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Abstract 


We study the computational and sample complexity of parameter and structure learning in graphical 
models. Our main result shows that the class of factor graphs with bounded degree can be learned 
in polynomial time and from a polynomial number of training examples, assuming that the data 
is generated by a network in this class. This result covers both parameter estimation for a known 
network structure and structure learning. It implies as a corollary that we can learn factor graphs for 
both Bayesian networks and Markov networks of bounded degree, in polynomial time and sample 
complexity. Importantly, unlike standard maximum likelihood estimation algorithms, our method 
does not require inference in the underlying network, and so applies to networks where inference 
is intractable. We also show that the error of our learned model degrades gracefully when the 
generating distribution is not a member of the target class of networks. In addition to our main 
result, we show that the sample complexity of parameter learning in graphical models has an O(1) 
dependence on the number of variables in the model when using the KL-divergence normalized by 
the number of variables as the performance criterion. ! 

Keywords: probabilistic graphical models, parameter and structure learning, factor graphs, Markov 
networks, Bayesian networks 


1. Introduction 


Graphical models are widely used to compactly represent structured probability distributions over 
(large) sets of random variables. Learning a graphical model from data is important for many 
applications. This learning problem can vary along several axes, including whether the data is fully 
or partially observed, and whether the structure of the network is given or needs to be learned from 
data. 

In this paper, we focus on the problem of learning both network structure and parameters from 
fully observable data, restricting attention to discrete probability distributions over finite sets. We 
focus on the problem of learning a factor graph representation (Kschischang et al., 2001) of the 
distribution. Factor graphs subsume both Bayesian networks and Markov networks, in that every 
Bayesian network or Markov network can be written as a factor graph of (essentially) the same 
size.” 





1. A preliminary version of some of this work was reported in Abbeel et al. (2005). 
2. The factor graph corresponding to either a Bayesian network or a Markov network can be constructed in linear time 
(as a function of the size of the original network). See, for example, Kschischang et al. (2001), and Yedidia et al. 
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We provide a new parameterization of factor graph distributions, which forms the basis for 
our results. In this new parameterization, every factor is written as a product of probabilities over 
the variables in the factor and its neighbors. We will refer to such subsets of variables as “local 
subsets of variables.” These local subsets of variables are of size at most d* for factor graphs 
of bounded degree d. Thus, for factor graphs of bounded degree d, the probabilities appearing 
in our new parameterization are over at most d? variables and can be estimated efficiently from 
training examples.* Hence this new parameterization naturally leads to an algorithm that solves the 
parameter learning problem in closed-form by estimating the probabilities over these local subsets 
of variables from training examples. We show that our closed-form estimation procedure results in 
a good estimate of the true distribution. More specifically, for factor graphs of bounded degree, if 
the generating distribution falls into the target class, we show that our estimation procedure returns 
an accurate solution—one of low KL-divergence from the true distribution—given a polynomial 
number of training examples. 

In contrast to our new parameterization, the factors in a factor graph (or a Markov network) are 
typically considered to have no probabilistic interpretation at all. One exception is the canonical 
parameterization used in the Hammersley-Clifford theorem for Markov networks (Hammersley and 
Clifford, 1971; Besag, 1974b). The Hammersley-Clifford canonical parameterization expresses the 
distribution as a product of probabilities over all variables. However, the number of different in- 
stantiations is exponential in the number of variables. Therefore such probabilities over all variables 
cannot be estimated accurately from a small number of training examples. As a consequence the 
Hammersley-Clifford canonical parameterization is not suited for parameter learning. 

Our closed-form parameter learning algorithm is the first polynomial-time and polynomial 
sample-complexity parameter learning algorithm for factor graphs of bounded degree, and thereby 
for Markov networks of bounded degree. In contrast, we do not know how to do maximum like- 
lihood (ML) estimation in Markov networks or factor graphs without evaluating the likelihood. 
Evaluating the likelihood is equivalent to evaluating the partition function. Evaluating the parti- 
tion function is known to be NP-hard, both exactly and approximately (Jerrum and Sinclair, 1993; 
Barahona, 1982). Indeed, all known exact algorithms grow exponentially in the tree-width of the 
graph, making the computation of the partition function intractable for many, even moderately sized, 
factor graphs. (See, for example, Cowell et al., 1999, for more details on such exact algorithms.) 
For example, n by n grids over binary variables (which have degree bounded by 4, independently 
of n) have tree-width n and the computational complexity of known algorithms for computing the 
partition function (and thus of known ML algorithms) is O(2”). 

We analyze the sample complexity of parameter learning as a function of the number of variables 
in the network. We show that (under some mild assumptions) the sample complexity of parameter 
learning in graphical models has on O(1) dependence on the number of variables in the graphical 
model when using KL-divergence normalized by the number of variables as the performance crite- 
rion. This result is important since it gives theoretical support for the common practice of learning 
large graphical models from a relatively small number of training examples. More specifically, the 
number of training examples can be much smaller than the number of parameters when learning 
large graphical models. 





(2001), for more details on the equivalence and conversion between factor graphs, Bayesian networks and Markov 
networks. 

3. For a pairwise Markov network with degree of the undirected graph bounded by d, the local subsets are of size at 
most 2d. 
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Building on our closed-form parameter learning algorithm, we provide an algorithm for learning 
not only the parameters, but also the structure. In our new parameterization, factors that are not 
present in the distribution can be computed in the same way from local probabilities as factors that 
are present in the distribution. As will become clear later, a key property of our new parameterization 
is that the factors not present in the distribution have all entries equal to one. This gives a very 
simple test to decide whether or not a factor is present in the distribution. Thus no iterative search 
procedure—as is common for most structure learning algorithms—is needed. However, to compute 
all the factors from local probabilities, we need to know which variables are its neighbors. So to 
complete the structure learning algorithm, we need to show how to find each factor’s neighbors. 
We show that local independence tests can be used to find the neighbors of each factor. Since local 
independence tests use statistics over a small number of variables only, the neighbors can be found 
efficiently from a small number of training examples. 


Our structure learning algorithm provides the first polynomial-time and polynomial sample- 
complexity structure learning algorithm for factor graphs, and thereby for Markov networks. Note 
that our algorithm applies to any factor graph of bounded degree, including those (such as grids) 
where inference is intractable. 


We also show that our algorithms degrade gracefully, in that they return reasonable answers 
even when the underlying distribution does not come exactly from the target class of networks. 


We note that the proposed algorithms are unlikely to be useful in practice in their current form. 
The structure learning algorithm does an exhaustive enumeration over the possible neighbor sets of 
factors in the factor graph, a process which is—although polynomial—generally infeasible even in 
moderately sized networks. Both the parameter and the structure learning algorithm do not make 
good use of all the available data. Nevertheless, the techniques used in our analysis open new 
avenues towards efficient parameter and structure learning in undirected, intractable models. 


The remainder of this paper is organized as follows. Section 2 provides necessary background 
about Gibbs distributions, the factor graph associated with a Gibbs distribution, Markov blankets 
and the Hammersley-Clifford canonical parameterization. In its original form, the Hammersley- 
Clifford theorem applies to Markov networks only. We provide an extension that applies to factor 
graphs. In Section 3, building on the canonical parameterization for factor graphs, we derive our 
novel parameterization, which forms the basis of our parameter estimation algorithm. We present 
our algorithm and provide formal running time and sample complexity guarantees. We conclude the 
section with an in-depth analysis of the relationship between the sample complexity and the number 
of random variables. In Section 4, we present our structure learning algorithm, and its formal 
guarantees. Section 5 discusses related work. For clarity of exposition, we provide the complete 
proofs of all theorems and propositions in the appendix. 


Table 1 gives an overview of the notation we use throughout this paper. 


2. Preliminaries 


In this section we first introduce Gibbs distributions, the factor graph associated with a Gibbs distri- 
bution, Markov blankets and the canonical parameterization. Then we present an extension of the 
Hammersley-Clifford theorem—which in its original form only applies to Markov networks—to 
factor graphs. Throughout the paper we restrict attention to discrete probability distributions over 
finite sets. 
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|_| factor node 


(x) variable node 


Figure 1: Example factor graph. 


2.1 Gibbs Distributions 


The probability distributions we consider are referred to as Gibbs distributions. 


Definition 1 (Gibbs distribution) A factor f with scope* D is a mapping from val(D) to Rt. A 
Gibbs distribution P over a set of random variables X = {X,,...,X,} is associated with a set of 


factors {fj Li with scopes {C jKr such that 


1 J 
P(X = Kiersey) = z LL fi(Cile,---.2)). 
j=l 


The normalizing constant Z is the partition function. 


The factor graph associated with a Gibbs distribution is a bipartite graph whose nodes corre- 
spond to variables and factors, with an edge between a variable X and a factor f; if the scope of f; 
contains X. There is one-to-one correspondence between factor graphs and the sets of scopes. Fig- 
ure 1 gives an example of a factor graph. Here the Gibbs distribution is over the variables X1,--- , Xo, 
which are represented by circles in the factor graph. The factors are represented by squares and 
have the following respective scopes: {X1,X2,X3}, {X1,X2}, {Xo,X3}, 1X1, Xa}, {X2,X5}, {X3, X6}, 
{X4,X5}, {X5, X6}, {X4,X7}, {X5, Xe}, {X7, Xo}, {X7,Xg}, {Xg,Xo}. The corresponding Gibbs dis- 
tribution is given by 


1 
P(X = x1; ,Xo = x9) = 5 Fx xo xa} 11 ¥25%3 ) Ff xo} 1142) = * f(x xo} (189). 


A Gibbs distribution also induces a Markov network—an undirected graph whose nodes corre- 
spond to the random variables X and where there is an edge between two variables if there is a factor 
in which they both participate. The set of scopes uniquely determines the structure of the Markov 
network, but several different sets of scopes can result in the same Markov network. For example, a 
fully connected Markov network can correspond both to a Gibbs distribution with (5) factors over 


pairs of variables, and to a distribution with a factor which is a joint distribution over X. We will 





4. A function has scope X if its domain is val(X), the set of possible instantiations of the set of random variables X. 
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use the more precise factor graph representation in this paper. Our results are easily translated into 
results for Markov networks. 


Definition 2 (Markov blanket) Let a set of scopes C = {C yet be given. The Markov blanket of 
a set of random variables D C X is defined as 


MB(D) =U{C; PCE: C;AD#0}—D. 


Thus, the Markov blanket of a set of variables D is the minimal set of variables that separates D from 

the other variables in the factor graph. For the factor graph distribution of Figure 1 we have, for ex- 

ample, MB({X;}) = {X2,X3, X4}, MB({X1,X%2}) = {X3, X4, X5}, and MB({Xs }) = {X2, X4, X6, X8}. 
For any Gibbs distribution, we have, for any subset of random variables D, that 


D1 X—D-—MB(D) | MB(D), (1) 


or in words: given its Markov blanket MB(D), the set of variables D is independent of all other 
variables X — D —MB(D).° 

A standard assumption for a Gibbs distribution, which is critical for identifying its structure 
(see Lauritzen, 1996, Ch. 3), is that the distribution be positive—all of its entries be non-zero. Our 
results use a quantitative measure for how positive P is. Let y = ming; P(X; = x;|X_; = x—;), where 
the —i subscript denotes all entries but entry i. Note that, if we have a fixed bound on the number 
of factors in which a variable can participate, a fixed bound on the domain size for each variable, 
and a fixed bound on how skewed each factor is (more specifically a bound on the ratio of its 
lowest and highest entries), we are guaranteed a bound on y that is independent of the number n 
of variables in the network. Thus, under these assumptions, our sample complexity results, which 
are expressed as a function of y, have no hidden dependence on the number of variables n. In 
contrast, Y = min, P(X = x) generally has an exponential dependence on n. For example, if we 
have n independent and identically distributed (1.i.d.) Bernoulli(5) random variables, then y = 5 
(independent of n) but Y= a 


2.2 The Canonical Parameterization 


A Gibbs distribution is generally over-parameterized relative to the structure of the underlying fac- 
tor graph, in that a continuum of possible parameterizations over the graph can all encode the same 
distribution. The canonical parameterization (Hammersley and Clifford, 1971; Besag, 1974b) pro- 
vides one specific choice of parameterization for a Gibbs distribution, with some nice properties 
(see below). The canonical parameterization forms the basis for the Hammersley-Clifford theorem, 
which asserts that any distribution that satisfies the independence assumptions encoded by a Markov 
network can be represented as a Gibbs distribution with factors corresponding to each of the cliques 
in the Markov network. In its original formulation, the canonical distribution is defined for Gibbs 
distributions over Markov networks. We use a more refined parameterization, defined at the factor 
level; results at the clique level (or, equivalently, results for Markov networks) are trivial corollaries. 

The canonical parameterization is defined relative to an arbitrary (but fixed) set of “default” 


assignments X = (<1,...,%,). Let any subset of variables D = (X;,,...,Xi)), and any assignment 





5. By X L Y we denote that X is independent of Y. By X L Y | Z we denote that X is conditionally independent of Y 
given Z. 


1747 


ABBEEL, KOLLER AND NG 


d= (xi, Xip) be given. Let any U C D be given. We define o.|-] such that for all i € {1,...,}: 


xi ifxX;,eU, 
(ould): { z if X;¢U. 


In words, Oy|d] keeps the assignments to the variables in U as specified in d, and augments it to 

form a full assignment using the default values in x. Note that the assignments to variables outside 

U are always ignored, and replaced with their default values. Thus, the scope of oy[-] is always U. 
Let P be a positive Gibbs distribution. The canonical factor for D C X is defined as follows: 


f5(d) = exp (Lucp(—1)/P- Vlog P(ou[d])) . (2) 


The sum is over all subsets of D, including D itself and the empty set 0. 
The following theorem extends the Hammersley-Clifford theorem (which applies to Markov 
networks) to factor graphs. 


Theorem 3 Let P be a positive Gibbs distribution with factor scopes {C itjer Let {Cj 3 = 
Ui 29 — (where 2* is the power set of X—the set of all of its subsets). Then 


— pa TT" 
P(x) = P(x) Ij- fe; (c4), 
where €; is the instantiation of C} consistent with x. 


The proof is in the appendix. 

The parameterization of P using the canonical factors { Se; 14 is called the canonical param- 
eterization of P. Although typically J* > J, the additional factors are all subfactors of the original 
factors. Note that first transforming a factor graph into a Markov network and then applying the 
Hammersley-Clifford theorem to the Markov network generally results in a significantly less sparse 
canonical parameterization than the canonical parameterization from Theorem 3. 

We now give an example to clarify the definition of canonical factors and canonical parameter- 
ization. 


Example 1 Consider again the factor graph of Figure 1. Assume we take the fixed assignment to 
be all zeros, namely we have x; =0,xX2 = 0,- -- ,X9 =0. Then the canonical factor f ixi x} Over the 
variables X, X instantiated to x1 ,xXz is given by l 














log fix, x} (41,2) = log P(X) x1,X2 x2, X3 0,X4 0, eer ,Xo — 0) 
—log P(X; = 0, X2 = x2,X3 = 0,X4 = 0, --- ,X9 = 0) 
—log P(X; = x1,X2 = 0,X3 = 0,X4 = 0, --- ,X9 = 0) 
+log P(X; = 0,X% = 0,X3 = 0,X4 =0,--- ,Xo = 0). (3) 




















So to compute the canonical factor, we start with the joint instantiation of the factor variables 
{X1, X2} with all other variables {X3,:-- ,X9} set to their default instantiations. Then we subtract 
out the instantiations for which one of the factor variables is changed to its default instantiation. 
Crudely speaking, we subtract out the interactions that are already captured by a canonical factor 
over a smaller set of variables. Then we adjust for double counting by adding back in the instanti- 
ation where both factor variables have been set to their default instantiation. 
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Similarly, the canonical factor fix XX} Over the variables Xı,X2,X3 instantiated to x1,X2,x3 iS 
given by 





log ffx, x.x} (¥1X2x3) = logP(X, = x1,X2 = x2,X3 = x3,X4 = 0,--- ,X9 = 0) 
—log P(X; = 0, X2 = x2,X3 = x3,X4 = 0,- -- ,X9 = 0) 
(Xi = x1, Xz = 0,X3 = x3,X4 = 0,--- ,X9 = 0) 
(Xi = x1, Xz = x2,X3 = 0,X4 = 0,--- ,X9 = 0) 
+log P(X, = 0,X2 = 0,X3 = x3,X4 =0,--- ,X9 = 0) 
( 
( 
( 



































X, =0,X2 = x2,X3 = 0,X4 =0,--- ,X9 = 0 
Xı x1, X2 0, X3 0, X4 0,- ,Xo 
— log P Xı 0,X2 0,X3 0,X4 0,- oe ,Xo = 0). 


ar 








I 
= 
ie 











The canonical factor over just the variable X; instantiated to x, is given by 





log ffy, (1) = log P(X, x1,X2 0, X3 0, X4 0, T , Xo = 0) 
—log P(X, 0,X2 0,X3 0,X4 0,- ,Xo =0). 











Theorem 3 applied to our example gives the following expression for the probability distribution: 


P(X, =%1,°°:,X9 = x9) = P(X; =0,--- ,X9 =0) 
X ffxi X X} (%1,42,%3) 
X fixi x} 112) F x9, x5} (2X3) = "Sixe Xo} (%89) 
x Fi EDS, 2) ++ Sik) 
1 
Z 
X ffxi X X} 1X243) 
X fixi xo} (1,2) fi x} 02X3) Six x} 80) 
x fien Cfi 2) fik 9). (4) 


3. Parameter Estimation 


In this section we first introduce the parameter estimation ideas informally by expanding on Ex- 
ample 1. Then we formally introduce the key idea of Markov blanket canonical factors, which 
give a parameterization of a factor graph distribution only in terms of local probabilities. This new 
parameterization directly results in the proposed parameter estimation algorithm. We analyze the 
algorithm’s computational and sample complexity. In addition, we show an O(1) dependence on 
the number of variables in the network for the sample complexity when using the KL-divergence 
normalized by the number of variables in the network as performance criterion. 


3.1 Parameter Estimation by Example 


Consider the problem of estimating the parameters of the distribution in Figure 1 from training 
examples. From Eqn. (4) we have that it is sufficient to estimate all the canonical factors. Each 
canonical factor is expressed in terms of probabilities. So one could estimate the canonical factors 
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(and thus the distribution) in closed-form by estimating these probabilities from data. Unfortunately 
the probabilities appearing in the canonical factors are over full joint instantiations of all variables. 
As a consequence, these probabilities can not be estimated accurately from a small amount of data. 

However, we will now consider the factor Fry, x} More carefully and show it can be estimated 
from probabilities over small subsets of the variables only. The factor Fey, xX} contains an equal 
number of terms with positive and negative sign. For the sum of two such terms, we now derive 
a novel expression which contains local probabilities only (instead of probabilities of full joint 
instantiations of all variables). 


log P(X, = x1, X% = x2,X3 = 0,X4 = 0,X5 = 0, X6 = 0,X7 = 0,Xg = 0,Xy = 0) 
=log P(X; = x1,X% = 0,X3 = 0,X4 = 0, X5 = 0,X6 = 0,X7 = 0, Xz = 0, Xo = 0) 
= log P(X, =x1,X2 = x2|X3 = 0,X4 = 0,X5 = 0,X6 = 0,X7 = 0,Xg = 0,Xy = 0) 
+log P(X; = 0,X4 = 0,X5 = 0,X6 =0,X7 = 0,Xg = 0, X9 = 0) 
—log P(X, = x1,X2 = 0|X3 = 0,X4 = 0,X5 = 0, X6 = 0,X7 = 0,Xg = 0,X9 = 0) 
—log P(X; = 0,X4 = 0,X5 = 0,X5 = 0,X7 = 0, Xs = 0,X9 = 0) 
= logP(X,; =x1,X2 = x2|X3 = 0,X4 = 0,X5 = 0,X5 = 0, X7 = 0,X3 = 0, Xo = 0) 
—log P(X, = x1,X% = 0|X3 = 0,X4 = 0,X5 = 0, X6 = 0,X7 = 0,Xs = 0,Xy = 0) 
= log P(X, =x1,X2 = x|MB({X;,X2}) = 0) 
—log P(X, = x1,X% = 0|MB({X1,X2}) = 0) 
= logP(X; =x1,X2 = x2|X3 = 0,X4 =0,X5 = 0) 
—log P(X, = x1,X% = 0|X3 = 0,X4 = 0,X5 = 0). (5) 




































































Here we used in order: the definition of conditional probability; same terms with opposite sign 
cancel; conditioning on the Markov blanket is equivalent to conditioning on all other variables; 
MB({X1,X2}) = {X3,X4,X5} in our example. 

The last expression in Eqn. (5) contains local probabilities only, which can be estimated accu- 
rately from a small number of training examples. Using a similar reasoning as above for the other 
two terms of the factor f ixi x} We get the following expression for f (x XP which contains local 
probabilities only: 





log fry, x} 4152) = logP(X; =x1,X2 = x2|X3 = 0,X4 =0,X5 = 0) 
—log P(X, = x1,X2 = 0|X3 = 0,X4 = 0,X5 = 0) 
—log P(X, = 0,X2 = x2|X3 = 0,X4 = 0,X5 = 0) 
+log P(X, = 0,X2 = 0|X3 = 0,X4 = 0, Xs = 0) 
= log fiy, x9}] {x5,X4,X5} 12). (6) 























The last line defines fix Xo} {X3,X4,X5} (x1,x2) (which we refer to as the Markov blanket canonical fac- 
tor for {X1,X2}). Although Fx, x} 01,2) = Fix, exx, 1 *2) when exact probabilities are 
used, we use different notation to explicitly distinguish how they are computed from probabilities. 
The Markov blanket canonical factor f ixi FANLA xx} ,X2) is computed from local probabilities 
as given in Eqn. (6). The (original) canonical factor f ix xo} (XI ,X2) is computed from probabilities 
over full joint instantiations as given in Eqn. (3). 
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Similarly, the other canonical factors have equivalent Markov blanket canonical factors which 
involve local probabilities only. This gives us an efficient closed-form parameter estimation algo- 
rithm for our example. In the next few sections we formalize this idea for general factor graphs and 
analyze the computational and sample complexity. 


3.2 Markov Blanket Canonical Factors 


Considering the definition of the canonical parameters, we note that all of the terms in Eqn. (2) can 
be estimated from empirical data using simple counts, without requiring inference over the network. 
Thus, it appears that we can use the canonical parameterization as the basis for our parameter 
estimation algorithm. However, as written, this estimation process is statistically infeasible, as 
the terms in Eqn. (2) are probabilities over full instantiations of all variables, which can never be 
estimated from a reasonable number of training examples. 

We now generalize our observation from the example in the previous section: namely, that we 
can express the canonical factors using only probabilities over much smaller instantiations—those 
corresponding to a factor and its Markov blanket. Let D = (Xi,,... Xip) be any subset of variables, 
and d = (x;,,-..,Xijp,) be any assignment to D. For any U C D, we define Ou:p[d] to be the restriction 
of the full instantiation oy|d] of all variables in X to the corresponding instantiation of the subset D. 
In other words, Oy-p[d] keeps the assignments to the variables in U as specified in d, and changes 
the assignment to the variables in D — U to the default values in X. Let D C X and Y C X —D. Then 
the factor Sly over the variables in D is defined as follows: 


f(d) = exp (Lucp(—1)'P-Pllog P(oup[d]|Y = 9), (7) 


where the sum is over all subsets of D, including D itself and the empty set 0. 

For example, we have that fix of the factor graph in Figure 1 is given by Eqn. (6) 
in the previous section. 

The following proposition shows an equivalence between the factors computed using Eqn. (2) 
and Eqn. (7). 


X2}|{X3,X4,X5} 


Proposition 4 Let P be a positive Gibbs distribution with factor scopes {C ie p and {CH | as 
above (i.é., {Cj a = Ut_)25 — 0). Then for any D C X, we have: 


Sp = Sp\x—-p = Adve): (8) 

and (as a direct consequence) 
P(x) = P(x) Mj- féx- (c) (9) 
= P(X) Ij- f&me) (€) (10) 


* 


where c} 


is the instantiation of C} consistent with x. 


Proposition 4 shows that we can compute the canonical parameterization factors using probabilities 
over factor scopes and their Markov blankets only. From a sample complexity point of view, this 
is a significant improvement over the standard definition which uses joint instantiations over all 
variables. Using Eqn. (7) we can expand the Markov blanket canonical factors in Proposition 4 and 
we see that any factor graph distribution can be parameterized as a product of local probabilities 
only. 
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X,Y,... 
X,Y, 
X,Y... 
X,y,... 
val(X) 
Dix] 


XLyY 








XLY|Z 





random variables 

instantiations of the random variables 

sets of random variables 

instantiations of sets of random variables 

set of values the variable X can take 

instantiation of D consistent with x (abbreviated as d when no ambiguity is 
possible) 

X is independent of Y 

X is conditionally independent of Y given Z 

factor 

positive Gibbs distribution over a set of random variables X = (X1,...,Xn) 
factors of P 

scopes of factors of P 

empirical (sample) distribution 

distribution returned by learning algorithm 

canonical factor as defined in Eqn. (2) 

canonical factor as defined in Eqn. (7) 

canonical factor as defined in Eqn. (7), but using the empirical distribution Ê 
Markov blanket of D 

max jC 

miny; P(X; = x| Xi = XLi) 

max;|val(X;)| 

max;|MB(C;)| 

number of training examples 


KL-divergence, D(P||Q) = Yxevaix P(x) log 5 x 


the set of candidate factor scopes for the structure learning algorithm, Factor- 
Graph-Structure-Learn (C = {C} : C} C X, C} #0, |C] <k) 








Table 1: Notational conventions. 


3.3 Parameter Estimation Algorithm 


Based on the parameterization above, we propose the following Factor-Graph-Parameter-Learn al- 
gorithm. The algorithm takes as inputs: the scopes of the factors {C ;} 


J 
j=1? 


{x}™_, a baseline instantiation &. Then for {CGK as above (i.e., {C} pe = U4 2% — 0), 
Factor-Graph-Parameter-Learn does the following: 


e Compute the estimates of the canonical factors { f&me h= , as in Eqn. (7), but using the 
J J 


empirical estimates based on the training examples. 


e Return the probability distribution P(x) « m- 1 forme (cz) (Cj). 
J J 
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Theorem 5 (Parameter learning: computational complexity) The running time of the Factor- 
Graph-Parameter-Learn algorithm is in O(m2¥J (k + b) + 2%™%Jv*).6 


The proof is given in the appendix. 

Note the representation of the factor graph distribution is Q(Jv*), thus exponential dependence 
on k is unavoidable for any algorithm. More importantly, there is no dependence on the running time 
of evaluating the partition function. On the other hand, all currently known maximum likelihood 
estimation algorithms require evaluating the partition function, which is known to be NP-hard, both 
exactly and approximately (Jerrum and Sinclair, 1993; Barahona, 1982). 


3.4 Sample Complexity 


We now analyze the sample complexity of the Factor-Graph-Parameter-Learn algorithm, showing 
that it returns a distribution that is a good approximation of the true distribution when given only a 
“small” number of training examples. We will use the sum of KL-divergences D(P||P) + D(P\||P) 
to measure how well the distribution P approximates the distribution P.’ 


Theorem 6 (Parameter learning: sample complexity) Let any€,6 > 0 be given. Let Factor-Graph- 
Parameter-Learn be given (a) m training examples xO. drawn i.i.d. from a distribution P and 
(b) the factor graph structure according to which the distribution P factors. Let P be the probability 
distribution returned by Factor-Graph-Parameter-Learn. Then, we have that, for 


D(P||P) + D(P||P) < Je 


to hold with probability at least 1 — 6, it suffices that the number of training examples m satisfies: 


e \2 243 
m> (1+ sue ) pre log 





gk+2 Jytte 


(1) 


A complete proof is given in the appendix. 

Theorem 6 shows that—assuming the true distribution P factors according to the given structure— 
Factor-Graph-Parameter-Learn returns a distribution that is Je-close in KL-divergence. The sample 
complexity scales exponentially in the maximum number of variables per factor k, and polynomially 
intl 


ki 
“The error in the KL-divergence grows linearly with the number of factors J. This is a con- 
sequence of the fact that the number of terms in the distributions is equal to the number of fac- 
tors J, and each term can accrue an error. We can obtain a more refined analysis if we elimi- 
nate this dependence by considering the KL-divergence normalized by the number of variables, 
D,(P\||P) = +D(P\||P). We return to this topic in Section 3.5. 

We now sketch the proof idea. The Markov blanket canonical factors are a product of local 
conditional probabilities. These local conditional probabilities can be estimated accurately from a 
“small” number of training examples. Thus the Markov blanket canonical factors can be estimated 
accurately from a small number of training examples. Thus the factor graph distribution—which is 
just a product of the Markov canonical factors—can be estimated accurately from a small number 


of training examples. 





6. The upper bound is based on a very naive implementation’s running time. It assumes that operations on numbers 


(such as reading, writing, adding, etc.) take constant time. 


7. D(P||Q) = Yxevalx P(x) log a : 
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Theorem 6 considers the case when P factors according to the given structure. The following 
theorem shows that our error degrades gracefully even if the training examples are generated by a 
distribution Q that does not factor according to the given structure. 


Theorem 7 (Parameter learning: graceful degradation) Let any €,6 > 0 be given. Let (xe, 

be i.i.d. samples from a distribution Q. Let MB and MB be the Markov blankets according to the 

distribution Q and the given structure respectively. Let { fòm be the non-trivial Markov 
J J 


blanket canonical factors of Q (those factors with not all entries equal to one). Let {Ci}, be the 
scopes of the canonical factors in the factor graph given to the algorithm. Let P be the probability 
distribution returned by Factor-Graph-Parameter-Learn. Then we have that for 





x 5 Jemec (c) 
D(Q\|P) + D(P||Q) < Je+29, maxa; log fp: (d;)| +2)" maxe; | log pe) 
EDI ACEH, j : MB(C})4MB(C*) C}|MB(Cs) ` 


to hold with probability at least 1 — 6, it suffices that the number of training examples m satisfies 
Eqn. (11) of Theorem 6. 


Note the sample complexity depends on parameters k = max ;|C}| and b = max ;|MB(C;)| of the 
given target structure (rather than the true structure). The graceful degradation result is important, 
as it shows that each canonical factor that is incorrectly captured by our target structure adds at most 
a constant (namely, /2/*! log + for an incorrectly captured factor over / variables) to our bound on 


the KL-divergence.* This constant can be large, so we discuss the actual error contribution in more 
detail. A canonical factor could be incorrectly captured when the corresponding factor scope is not 
included in the given structure. Canonical factors are designed so that a factor over a set of variables 
captures only the residual interactions between the variables in its scope, once all interactions be- 
tween its subsets have been accounted for in other factors. Thus, canonical factors over large scopes 
are often close to the trivial all-ones factor in practice. Therefore, if our structure approximation 
is such that it only ignores some of the larger-scope factors, the error in the approximation may be 
quite limited. A canonical factor could also be incorrectly captured when the given structure does 
not have the correct Markov blanket for that factor. The resulting error depends on how good an 
approximation of the Markov blanket we do have. See Section 4 for more details on the error caused 
by incorrect Markov blankets. 


3.5 Reducing the Dependence on Network Size 


Our previous analysis showed a linear dependence of the sample complexity on the number of 
factors J in the network (for parameter learning). In a sense, this dependence is inevitable. To un- 
derstand why, consider a distribution P defined by a set of n independent Bernoulli random variables 
X1,- ..,Xn, each with parameter 0.5. Assume that Q is an approximation to P, where the X; are still 
independent, but have parameter 0.4999. Intuitively, a Bernoulli(0.4999) distribution is a very good 





8. Each factor over / variables is a fraction of a product of 2'~! conditional probabilities over another product of 2!~! 
conditional probabilities. Recall that y= minx ; P(X; = x;|X_j = x_i) > 0, so we have that each conditional probability 
over l variables lies in the interval [7,1]. Thus we have for a factor over l variables that maxa; | log fy: (| < 

J 





l- = ]2'"!]Jog1. Similarly, we have that max: | lo fez mep €) < 12! log } 
8 P y, cj 8 Teme ©) = 8 y’ 


po 





log 
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estimate of a Bernoulli(0.5); thus, for most applications, Q can safely be considered to be a very 
good estimate of P. However, the KL-divergence D(P(X1:n)||Q(X1:n)) = X D(P(X;)||Q(%)) = 
Q(n). Thus, if n is large, the KL divergence between P and Q would be large, even though Q is a 
good estimate for P. To remove such unintuitive scaling effects when studying the dependence on 
the number of variables, we can consider instead the normalized KL divergence criterion: 


Dy(P(X1:n)||O(X1-n)) = *D(P(Xi:n)||O(Xi-n)): 


As we now show, with a slight modification to the algorithm, we can achieve a bound of € for 
our normalized KL-divergence while eliminating the logarithmic dependence on J in our sample 
complexity bound. Specifically, we can modify our algorithm so that it clips probability estimates 
E [0,y**") to yt}. The clipping procedure is motivated by the proof of Theorem 8 and effectively 
ensures that the KL-divergence is bounded.’ Note that—since true probabilities which we are trying 
to estimate are never in the interval [0,y+?)—this change can only improve the estimates. !° 

For this slightly modified version of the algorithm, the following theorem shows the dependence 
on the size of the network is O(1), which is tighter than the logarithmic dependence shown in 
Theorem 6.!! 


Theorem 8 (Parameter learning: size of the network) Let any €,6 > 0 be given and fixed. Let 
pye | be i.i.d. samples from P. Let the domain size of each variable be fixed. Let the degree of 
both the factor and variable nodes be bounded by a fixed constant. Let y = minx ; P(X; = xi| Xi = 
x_i) be fixed. Let P be the probability distribution returned by Factor-Graph-Parameter-Learn. Then 
we have that, for 

D,(PI|P) +D,(AI\P) < € 


to hold with probability at least 1 — 6, it suffices that we have a certain number of training examples 
that does not depend on the number of variables in the network. 


The following theorem shows a similar result for Bayesian networks, namely that for a fixed 
bound on the number of parents per node, the sample complexity dependence on the size of the 
network is O(1).!? 





9. In particular, we first show that the error contribution from any fixed factor is small with high probability. Then— 
rather than using a Union bound to ensure the error contributions from all factors are small, which would result 
in a logarithmic dependence of the sample complexity on the number of factors (or variables)—we use Markov’s 
inequality to show that the error contribution of almost all factors is small with high probability. This leaves us to 
bound the error contribution of the (few) remaining factors, for which the error contribution is not small. By clipping 
the probability estimates, we can ensure their error contribution is bounded. A very similar reasoning applies to the 
case of Theorem 9. (See the proofs of Theorems 8 and 9, given in the appendix, for more details.) 

10. This solution assumes that y is known. If not, we can use a clipping threshold as a function of the number of training 
examples. Such an adaptive clipping procedure was used by Dasgupta (1997) to derive sample complexity bounds 
for learning fixed structure Bayesian networks. 

11. We note that Theorem 8 assumes the maximum number of factors a variable can participate in is fixed (1.e., it cannot 
grow with the number of variables in the network). As a consequence, the dependence on the number of factors J 
and the dependence on the number of variables n are equivalent (up to a constant factor). 

12. Complete proofs for Theorems 8 and 9 (and all other results in this paper) are given in the appendix of this paper. 
In the appendix we actually give a much stronger version of Theorem 9, including dependencies of m on €,6,k and 
a graceful degradation result. We note that for non-binary random variables the clipping procedure is a bit more 
subtle than for binary random variables. In particular, to ensure that the resulting clipped probabilities sum to one, 
we might have to subtract a small quantity from the highest probability estimate after the clipping. For example, for 
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Theorem 9 Let any £ > 0 and 6 > 0 be given. Let any Bayesian network (BN) structure over n 
variables with at most k parents per variable be given. Let P be a probability distribution that 
factors over the BN. Let P denote the probability distribution obtained by fitting the conditional 
probability tables (CPT) entries via maximum likelihood and then clipping each CPT entry to the 


interval lamaie wl ™ z]. Then we have that for 





8]val 
D,(P\|P) < € 


to hold with probability at least 1 — ò, it suffices that we have a certain number of training examples 
that does not depend on the number of variables in the network. 


Theorems 8 and 9 provide theoretical support for the common practice of learning large graph- 
ical models from a relatively small number of training examples. More specifically, the number of 
training examples can be much smaller than the number of parameters when learning large graphical 
models. In contrast, for many problems in machine learning, the sample complexity grows roughly 
linearly or at most as some low-order polynomial in the number of parameters (Vapnik, 1998). 
The difference in sample complexity relates to the discussion of generative versus discriminative 
training. Indeed our result generalizes and even strengthens the results of Ng and Jordan (2002). 
They showed a logarithmic dependence on the number of variables for the very specific case of a 
graphical model with the naive Bayes structure. 


4. Structure Learning 


The algorithm described in the previous section uses the known network to establish a Markov 
blanket for each factor. This Markov blanket is then used to estimate the canonical parameters from 
empirical data. In this section, we show how we can build on this algorithm to perform structure 
learning, by first identifying (from the data) an approximate Markov blanket for each candidate 
factor, and then using this approximate Markov blanket to compute the parameters of that factor 
from a “small” number of training examples. 


4.1 Identifying Markov Blankets 


In the parameter learning results, the Markov blanket MB(C;%) is used to efficiently estimate the 
conditional probability P(C7|X — C;), which is equal to P(Cj|MB(C;)). This suggests to measure 
the quality of a candidate Markov blanket Y by how well P(C;|Y) approximates P(C}|X — C+). In 
this section we show how conditional entropy can be used to find a candidate Markov blanket that 
gives a good approximation for this conditional probability. '° 





€ sufficiently small, we have that naively clipping the probability estimates (0,0,1/4,3/4) to the interval (£, 1 — £) 
results in (€,€, 1/4,3/4), which does not sum to one (but rather to 1+ 2). Subtracting the additional probability 
mass 2e from the highest entry fixes this problem. For this example we get (€,¢,1/4,3/4—2e). In general, for 
v-valued random variables, the probability estimates can be made to sum to one (after clipping) by subtracting at 
most (v— 1)e from the highest probability estimate. In the appendix we expand more on the topic of clipping for 
non-binary random variables. 

13. For some readers, some intuition might be gained from the fact that the conditional entropy of C* given the candidate 
Markov blanket Y corresponds to the log-loss of predicting C} given the candidate Markov blanket Y. 
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Definition 10 (Conditional Entropy) Let P be a probability distribution over over X,Y. Then the 
conditional entropy H(X|Y) of X given Y is defined as 


-) P(X =x, Y = y) logP(K = x|Y = y). 
x€val(X),yeval(Y) 


Proposition 11 (Cover & Thomas, 1991) Let P be a probability distribution over X,Y,Z. Then 
we have H(X|Y,Z) < H(X|Y). 


Proposition 11 shows that conditional entropy can be used to find the Markov blanket for a given 
set of variables. Namely, let D,Y C X, DN Y = 0, then we have 


H(D|MB(D)) = H(D|X — D) < H(DIY), (12) 


where the equality follows from the Markov blanket property stated in Eqn. (1) and the inequality 
follows from Proposition 11. Thus, we can select the set of variables Y that minimizes H(D|Y) as 
our candidate Markov blanket for the set of variables D. 

Our first difficulty is that, when learning from data, we do not have the true distribution, and 
hence the exact conditional entropies are unknown. The following lemma shows that the conditional 
entropy can be efficiently estimated from samples. 


Lemma 12 Let P be a probability distribution over X,Y such that for all instantiations x,y we have 
P(X =x,Y=y) >A. Let Hi be the conditional entropy computed based upon mi.i.d. samples from 
P. Then for 

|H(X|Y) —A(X|Y)| <e 


to hold with probability 1 — ð, it suffices that: 


4|val(X)||val(Y)| , 


m> Aalen ah ale log | 


2g 





However, as the empirical estimates of the conditional entropy are noisy, the true Markov blan- 
ket is not guaranteed to achieve the minimum of H(DJ/Y). In fact, in some probability distributions, 
many sets of variables could be arbitrarily close to reaching equality in Eqn. (12). Thus, in many 
cases, our procedure will not recover the actual Markov blanket, when given only a finite num- 
ber of training examples. Fortunately, as we show in the next lemma, any set of variables UU W 
that is close to achieving equality in Eqn. (12) gives an accurate approximation P(C ;|U, W) of the 
probabilities P(C ;|X — C;) used in the canonical parameterization. 


Lemma 13 Let any £ > 0 be given. Let P be a distribution over disjoint sets of random variables 
U,V,W,X,Y. Let ù = MiNucval(U),veval(V),weval(W) P(u,v,w), and let 
M= MiNyeval(X),ueval(U),veval(V),weval(W) P(xļu, v, w). Assume the following holds: 
X LY,W | U,V, (13) 
H(X|U,W) < A(X|U,V,W,Y) +e. (14) 


Then we have that V x,y,U,V,W 


X 


2€ 
| log P(x|u, v, w,y) — log P(x|u, w)| < el (15) 
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In other words, if a set of variables U U W looks like a Markov blanket for X, as evaluated by the 
conditional entropy H(X|U,W), then the conditional distribution P(X|U,W) must be close to the 
conditional distribution P(X|X — X). Thus, it suffices to find such an approximate Markov blanket 
UUW as a substitute for knowing the true Markov blanket UU V. This makes conditional entropy 
suitable for structure learning. 


4.2 Structure Learning Algorithm 


We propose the following Factor-Graph-Structure-Learn algorithm. The algorithm receives as input: 
training examples {xO} 1; k: the maximum number of variables per factor; b: the maximum 
number of variables per Markov blanket for any set of variables up to size k; X: a base instantiation. '* 

Let C be the set of candidate factor scopes, let Y be the set of candidate Markov blankets. I.e., 
we have 


C = {0 : C CX, #0, |C] <k}, (16) 
Y = {Y:YCX,|Y|<b}. (17) 


The algorithm does the following: 


e V C} € C, find MB (Cj) = arg minyey,ync;=0 f (C3[Y), which is the best candidate Markov 
blanket. 


e V C} € C, compute the estimates { i ; of the canonical factors as defined in Eqn. (7) 


B 
using the empirical distribution. 


e Threshold to one the factor entries fe satisfying | log f* (c3)| < zz, and 


mel c3) *|MB(C$) = 


discard the factors that have all entries equal to one. 


e Return the probability distribution P(x) œ J]; Teesi cG ). 

The thresholding step finds the factors that actually contribute to the distribution. The specific 
threshold is chosen to suit the proof of Theorem 15. If no thresholding were applied, the error 
in Eqn. (18) would be ide instead of Je, which is much larger in case the true distribution has a 
relatively small number of factors J. 


Theorem 14 (Structure learning: computational complexity) The running time! of Factor- 
Graph-Structure-Learn is in O(mkn*bn?(k + b) + kn*bn” vk? + knk2kyk), 


Thus the running time is exponential in the maximum factor scope size k and the maximum Markov 
blanket size b, polynomial in the number of variables n and the maximum domain size v, and linear 
in the number of training examples m. 

The first two terms in Theorem 14 result from going through the data and computing the em- 
pirical conditional entropies. Since the algorithm considers all combinations of candidate factors 
and Markov blankets, we have an exponential dependence on the maximum scope size k and the 





14. Note in the parameter learning setting we had b equal to the size the largest Markov blanket for an actual factor in 
the distribution. In contrast, now b corresponds to the size of the largest Markov blanket for any candidate factor up 
to size k. 

15. The upper bound is based on a very naive implementation’s running time. 
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maximum Markov blanket size b. The last term comes from computing the Markov blanket canon- 
ical factors. Importantly, unlike for currently-known (exact) ML approaches, the running time does 
not depend on the tractability of inference in the (unknown) factor graph from which the data was 
sampled, nor on the tractability of inference in the recovered factor graph. 


Theorem 15 (Structure learning: sample complexity) Let any €,6 > 0 be given. Let Factor-Graph- 
Structure-Learn be given (a) m training examples {x " , drawn i.i.d. from a distribution P, (b) 
an upper bound k on the number of variables per factor in the factor graph for P, and (c) an upper 
bound b on the number of variables per Markov blanket for any set of variables up to size k in the 
factor graph for P. Let P be the distribution returned by Factor-Graph-Structure-Learn. Then for 


D(P\|P) + D(P||P) < Je (18) 


to hold with probability 1 — 6, it suffices that the number of training examples m satisfies: 





eth 2k+2b 98419 Bkbnktb yk+b 
m> (1+ SEES) Pro min{e?,et} lo 5 . (19) 


Proof (sketch). From Lemmas 12 and 13 we have that the conditioning set chosen by Factor-Graph- 
Structure-Learn results in a good approximation of the true canonical factor. At this point the 
structure is fixed, and we can use the sample complexity theorem for parameter learning to finish 
the proof. a 


Theorem 15 shows that the sample complexity depends exponentially on the maximum factor size 
k and the maximum Markov blanket size b; and polynomially on ’ and E If we modify the analysis 
to consider the normalized KL-divergence, as in Section 3.5, we obtain a logarithmic dependence 
on the number of variables in the network. 

To understand the implications of this theorem, consider the class of Gibbs distributions where 
every variable can participate in at most d factors and every factor can have at most k variables 
in its scope. Then we have that the Markov blanket size b < dk*. Bayesian network probability 
distributions can also be represented using factor graphs.! If the number of parents per variable 
is bounded by numP and the number of children per variable is bounded by numC, then we have 
k < numP +1, and that b < (numC + 1)(numP +1)”. Thus our factor graph structure learning 
algorithm allows us to efficiently learn distributions that can be represented by Bayesian networks 
with a bounded number of children and parents per variable. Note that our algorithm recovers 
a distribution which is close to the true generating distribution, but the distribution it returns is 
encoded as a factor graph, which may not be representable as a compact Bayesian network. 

Theorem 15 considers the case where the generating distribution P factors according to a struc- 
ture with factor scope sizes bounded by k and size of Markov blankets (of any subset of variables of 
size less than k) bounded by b. As we did in the case of parameter estimation, we can show that we 
have graceful degradation of performance for distributions that do not satisfy these assumptions. 





16. Given a Bayesian network (BN), the following factor graph represents the same distribution: The factor graph has 
one variable node per variable in the BN. The factor graph has one factor for each variable in the BN. Each factor’s 
scope is equal to the union of the corresponding variable itself and its parents. Each factor’s entries are equal to the 
corresponding conditional probability table entries of the BN. 
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Theorem 16 (Structure learning: graceful degradation) Let any €,6 > 0 be given. Let {x jaan 
be training examples drawn i.i.d. from a distribution Q. Let MB and MB be the Markov blan- 
kets according to the distributions Q and found by Factor-Graph-Structure-Learn respectively. Let 
{ JD MB (vs) j be the non-trivial Markov blanket canonical factors of Q (those factors with not all 
entries equal to one). Let J be the number of non-trivial Markov blanket canonical factors in Q with 
scope size smaller than k. Let P be the probability distribution returned by Factor-Graph-Parameter- 
Learn. Then we have that for 


D(Q||P)+D(P]Q) < (J+|s|)e+ 2}, maxa 
j:|D}|>k 





log fp; (d;)| 


Sex MB(C* (c7) 
+ 2 >: MAX ¢* log 
CiEC : |MB(C%)|>b fomm © 








to hold with probability at least 1 — 6, it suffices that the number of training examples m satisfies 
Eqn. (19) of Theorem 15. Here S = {j : C} ¢ {D1}, |MB(C})| > b} is the set that indexes over 
the subsets of variables of size smaller than k over which there is no factor in the true distribution 
and for which the Markov blanket in the true distribution is larger than b; C is the set of candidate 
factor scopes C = {Cj : C} C X, C} #0, |C}]| < k}. 


Theorem 16 shows that (similar to the parameter learning setting) each canonical factor that 
is not captured by our learned structure contributes at most a constant to our bound on the KL- 
divergence (namely /2'*! log | for a factor over / variables, see footnote 8 for details) to our bound 
on the KL-divergence. This bound on the error contribution can be large, so we discuss the actual 
error contribution in more detail. The reason a canonical factor is not captured could be two-fold. 
First, the scope of the factor could be too large. The paragraph after Theorem 7 discusses when 
the resulting error is expected to be small. Second, the Markov blanket of the factor could be too 
large. As shown in Lemma 13, a good approximate Markov blanket is sufficient to get a good 
approximation. So we can expect these error contributions to be small if the true distribution is 
mostly determined by interactions between small sets of variables. 

Recall that the structure learning algorithm correctly clips all estimates of trivial canonical fac- 
tors to the trivial all-ones factor, when the structural assumptions are satisfied. I.e., trivial factors 
are correctly estimated as trivial if their Markov blanket is of size smaller than b. The additional 
term |S|€ corresponds to estimation error on the factors that are trivial in the true distribution but 
that have a Markov blanket of size larger than b, and are thus not correctly estimated and clipped to 
trivial all-ones factors. 


5. Related Work 


Tables 2 and 3 summarize the prior work on Markov network and Bayesian network learning that 
comes with formal guarantees. In the following two sections we discuss the prior work on Markov 
network (factor graph) learning and Bayesian network learning in more detail. We also discuss 
algorithms that do not have formal guarantees. 
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Target distribution True distribution  Structure/Parameter | Samples Time Graceful degradation Reference 
ML tree any structure poly poly yes [1] 
ML bounded tree-width any structure poly NP-hard yes [2] 
Bounded tree-width same structure poly poly no [3] 
Factor graph same parameter infinite convex no [4], [5] 
Factor graph same parameter poly poly yes [6] 
Factor graph same structure poly poly yes [6] 














Table 2: Overview of prior work on learning Markov networks that has formal guarantees. More 
details are given in Section 5.1. The references in the table are: [1]: Chow and Liu (1968); 
[2] Srebro (2001); [3]: Narasimhan and Bilmes (2004); [4]: Besag (1974b); [5]: Gidas 
(1988); [6]: this paper. “Convex” refers to the time of solving a convex optimization 
problem. 


5.1 Markov Networks 


We split the discussion into two parts: parameter learning and structure learning. 


5.1.1 PARAMETER LEARNING 


The most natural algorithm for parameter estimation in undirected graphical models is maximum 
likelihood (ML) estimation (possibly with some regularization). Unfortunately, evaluating the like- 
lihood of such a model requires evaluating the partition function. All currently known ML algo- 
rithms for undirected graphical models require evaluating the partition function. Therefore, they 
are computationally tractable only for networks in which inference is computationally tractable. 
In contrast, our closed form solution can be efficiently computed from the data, even for Markov 
networks where inference is intractable. Note that our estimator does not return the ML solution, 
so that our result does not contradict the “hardness” of ML estimation. However, it does provide 
a low KL-divergence estimate of the probability distribution, with high probability, from a “small” 
number of training examples, assuming the true distribution approximately factors according to the 
given structure. 

Criteria different from ML have been proposed for learning Markov networks. The most promi- 
nent one is pseudo-likelihood (Besag, 1974b), and its extension, generalized pseudo-likelihood 
(Huang and Ogata, 2002). The pseudo-likelihood criterion gives rise to a tractable convex opti- 
mization problem. Pseudo-likelihood estimation is consistent, that is, in the infinite sample limit it 
returns the true distribution, when the assumed structure is correct. (See, for example, Gidas, 1988, 
.) However, in the finite sample case the pseudo-likelihood estimate is often significantly worse 
than the maximun likelihood estimate. More information on the statistical efficiency of the pseudo- 
likelihood estimate can be found in, for example, Besag (1974a); Geyer and Thompson (1992); 
Guyon and Kiinsch (1992). In contrast to our results, no finite sample bounds have been provided 
for pseudo-likelihood estimation. Moreover, the theoretical analyses (e.g., Geman and Graffigne, 
1986; Comets, 1992; Guyon and Kiinsch, 1992) only apply when the generating model is in the true 
target class. 


5.1.2 STRUCTURE LEARNING 


Structure learning for Markov networks is notoriously difficult, as it is generally based on using ML 
estimation of the parameters (with smoothing), often combined with a penalty term for structure 
complexity. As evaluating the likelihood is only possible for the class of Markov networks in which 
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Target distribution True distribution  Structure/Parameter | Samples Time Graceful degradation Reference 
ML polytree any structure poly NP-hard yes [1], [2] 
ML BN any structure poly NP-hard yes [1], [3] 
BN same structure infinite poly yes [4], [5] 
Factor graph BN (same) structure poly poly yes [6] 














Table 3: Overview of prior work on learning Bayesian networks that has formal guarantees. More 
details are given in Section 5.2. The references in th table are: [1]: Höffgen (1993); [2]: 
Dasgupta (1999); [3]: Chickering et al. (2003); [4]: Spirtes et al. (2000); [5]: Cheng et al. 
(2002); [6]: this paper. 


inference is tractable, there have been two main research tracks for ML structure learning. The first, 
starting with the work of Della Pietra et al. (1997), uses local-search heuristics to add factors into 
the network (see also McCallum, 2003). The second searches for a structure within a restricted 
class of models in which inference is tractable, more specifically, bounded tree-width Markov net- 
works. Indeed, ML learning of the class of tree Markov networks—networks of tree-width 1—can 
be performed very efficiently (Chow and Liu, 1968). Unfortunately, Srebro (2001) proves that for 
any tree-width k greater than 1, even finding the ML tree-width-k network is NP-hard. Karger and 
Srebro (2001) provide an approximation algorithm but the approximation factor is a very large mul- 
tiplicative factor of the log-likelihood. In particular, for tree-width k, they find a Markov network (of 
tree-width k) with log-likelihood at least 1/(8*k!(k + 1)!) times the optimal log-likelihood. Several 
heuristic algorithms to learn models with small tree-width have been proposed (Malvestuto, 1991; 
Bach and Jordan, 2002; Deshpande et al., 2001), but (not surprisingly, given the NP-hardness of the 
problem) they do not come with any performance guarantees. 

Recently, Narasimhan and Bilmes (2004) provided a polynomial time algorithm with a polyno- 
mial sample complexity guarantee for the class of Markov networks of bounded tree-width. Their 
algorithm computes approximate conditional independence information followed by dynamic pro- 
gramming to recover the bounded tree-width structure. The parameters for the recovered bounded 
tree-width model are estimated by standard ML methods. Our algorithm applies to a different fam- 
ily of distributions: factor graphs of bounded connectivity (including graphs in which inference is 
intractable). Factor graphs with small connectivity can have large tree-width (e.g., grids) and fac- 
tor graphs with small tree-width can have large connectivity (e.g., star graphs). Thus, the range of 
applicability is incomparable. Narasimhan and Bilmes (2004) did not provide any graceful degra- 
dation guarantees when the generating distribution is not a member of the target class. However, 
future research might extend their algorithm to this setting. 

Pseudo-likelihood has been extended to a criterion for model selection: the resulting criterion 
is Statistically consistent (Ji and Seymour, 1996). In particular they show that the probability of 
selecting an incorrect model goes to zero as the number of training examples goes to infinity. They 
also provide a bound on how fast this probability goes to zero. Importantly, Ji and Seymour (1996) 
only provide a model selection criterion. They do not provide an algorithm to efficiently find the 
best pseudo-likelihood model (according to their evaluation criterion) over the super-exponentially 
large set of candidate models from which we want to select in the structure learning problem. 


5.2 Bayesian Networks 


Again, we split the discussion into two parts: parameter learning and structure learning. 


1762 


LEARNING FACTOR GRAPHS IN POLYNOMIAL TIME AND SAMPLE COMPLEXITY 


5.2.1 PARAMETER LEARNING 


ML parameter learning in Bayesian networks (possibly with smoothing) only requires computing 
the empirical conditional probabilities of each variable given its parent instantiations. Thus there is 
no computational challenge. 


Dasgupta (1997), following earlier work by Friedman and Yakhini (1996), analyzes the sample 
complexity of learning Bayesian networks, showing that it is polynomial in the maximal number 
of different instantiations per family. His sample complexity result has logarithmic dependence on 
the number of variables in the network, when using the KL-divergence normalized by the number 
of variables in the network. In this paper, we strengthen his result, showing an O(1) dependence 
of the number of training examples on the number of variables in the network. So for bounded 
fan-in Bayesian networks, the sample complexity is independent of the number of variables in the 
network. 


5.2.2 STRUCTURE LEARNING 


Results analyzing the complexity of structure learning of Bayesian networks fall largely into two 
classes. The first class of results assumes that the generating distribution is DAG-perfect with 
respect to some DAG G with at most k parents for each node. (That is, P and G satisfy precisely 
the same independence assertions.) In this case, algorithms based on various independence tests 
(Spirtes et al., 2000; Cheng et al., 2002) can identify the correct network structure in the infinite 
sample limit (i.e., when given an infinite number of training examples), using a polynomial number 
of independence tests. The infinite sample limit setting is critical in their analysis since it allows for 
exact independence tests. Neither Spirtes et al. (2000) nor Cheng et al. (2002) provide guarantees 
for the case of a finite number of training examples, but future research might extend their results to 
this setting. Chickering and Meek (2002) relax the assumption that the distribution be DAG-perfect; 
they show that, under a certain assumption, a simple greedy algorithm will, in the infinite sample 
limit, identify a network structure which is a minimal I-map of the distribution. They provide 
no polynomial time guarantees, but future work might provide such guarantees for models with 
bounded connectedness (such as the ones our algorithm considers). 


The second class of results relates to the problem of finding a network structure whose score is 
high, for a given set of training examples and some appropriate scoring function. Although finding 
the highest-scoring tree-structured network can be done in polynomial time (Chow and Liu, 1968), 
Chickering (1996) shows that the problem of finding the highest scoring Bayesian network where 
each variable has at most k parents is NP-hard, for any k > 2. (See Chickering et al., 2003, for 
details.) Even finding the maximum likelihood structure among the class of polytrees (Dasgupta, 
1999) or paths (Meek, 2001) is NP-hard. These results do not address the question of the number 
of training examples for which the highest scoring network is guaranteed to be close to the true 
generating distribution. 


Höffgen (1993) analyzes the problem of PAC-learning the structure of Bayesian networks with 
bounded fan-in, showing that the sample complexity depends only logarithmically on the number of 
variables in the network (when considering KL-divergence normalized by the number of variables 
in the network). Hoffgen does not provide an efficient learning algorithm (and to date, no efficient 
learning algorithm is known), stating only that if the optimal network for a given data set can be 
found (e.g., by exhaustive enumeration), it will be close to optimal with high probability. 
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In contrast, we provide a polynomial-time learning algorithm with similar performance guaran- 
tees for Bayesian networks with bounded fan-in and bounded fan-out. However, we note that our 
algorithm does not construct a Bayesian network representation, but rather a factor graph; this factor 
graph may not be compactly representable as a Bayesian network, but it is guaranteed to encode a 
distribution which is close to the generating distribution, with high probability. 


6. Discussion 


We have presented the first polynomial-time and polynomial sample-complexity algorithms for both 
parameter estimation and structure learning in factor graphs of bounded degree. When the generat- 
ing distribution is within this class of networks, our algorithms are guaranteed to return a distribution 
close to it, using a polynomial number of training examples. When the generating distribution is 
not in this class, our algorithm degrades gracefully. Thus our algorithms and analysis are the first to 
establish the efficient learnability of an important class of distributions. 

While of significant theoretical interest, our algorithms, as described, are probably impractical. 
From a statistical perspective, our algorithm is based on the canonical parameterization, which is 
evaluated relative to a canonical assignment x. Many of the empirical estimates that we compute 
in the algorithm use only a subset of the training examples that are (in some ways) consistent with 
x. As a consequence, we make very inefficient use of data, in that many training examples may 
never be used. In regimes where data is not abundant, this limitation may be quite significant in 
practice. From a computational perspective, our algorithm uses exhaustive enumeration over all 
possible factors up to some size k, and over all possible Markov blankets up to size b. When we fix 
k and b to be constant, the complexity is polynomial. But in practice, the set of all subsets of size k 
or b is often much too large to search exhaustively. 

Nevertheless, aside from proving the efficient learnability of an important class of probability 
distributions, the algorithms we propose might provide insight into the development of new learning 
algorithms that do work well in practice. In particular, we might be able to address the statistical 
limitation by putting together canonical factor estimates from multiple canonical assignments x. We 
might be able to address the computational limitation using a more clever (perhaps heuristic) algo- 
rithm for searching over subsets. Given the limitations of existing parameter and structure learning 
algorithms for undirected models, we believe that the techniques suggested by our theoretical anal- 
ysis are well worth exploring. 
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Appendix A. Proofs for Section 2.2 


In this section we give formal proofs of all theorems, propositions and lemmas appearing in Sec- 
tion 2.2. 


A.1 Proof of Theorem 3 


Proof [Theorem 3] The proof consists of two parts: 
1. If we let {C¥}J_, = 2* — 0, then P(x) = P(x) TT Li féle c*). 


2. If P is a positive Gibbs distribution with factor scopes {C ;}/ 
are trivial all-ones factors, whenever D ¢ Ui 120 : 


=P then the canonical factors f 


The first part states that the canonical parameterization gives the correct distribution assuming we 
use a canonical factor for each subset of variables. It is easily verified by counting how often the 
probabilities P(oy|d]) contribute for each U C D C X, and is a standard part of most Hammersley- 
Clifford theorem proofs. The second part states that we can ignore canonical factors over subsets of 
variables that do not appear together in one of the factor scopes {C ite 1- We now prove the second 
part. We have 


log f(d) = A (YP MogP(culd]) 


2e DF je ($e fe jlo ula) +103) 
on 
Le 


2- 1)PUlog fc,(C;[ov{d]]). (20) 


To obtain the last equality, we used the fact that there is an equal number of terms (log 2) and 
(—log $). Now consider the contribution of one factor fc, in the above expression. By assumption 
we have that D ¢ CA and thus D — C; £ 0. Now let Y be any element of D — C j. Then we have 
that 


L (-1)? “og fe;(Cilouldl]) = }, (1P log fe;(C;louldl]) 


a UCD-Y 
ay sD aca log fce;(Cj[Suur|d]]). 


Now since Y ¢ C;, we have C ;|ou|d]] = C;|[Ouuy[d]]. And thus we get 


Y (-1)P "log fo,(Cilould]]) = 0. (21) 
UCD 
And thus combining Eqn. (21) with Eqn. (20) establishes the second part of the proof. a 
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Appendix B. Proofs for Section 3.2 


In this section we give formal proofs of all theorems, propositions and lemmas appearing in Sec- 
tion 3.2. 

Proof [Proposition 4] In Eqn. (2) the number of terms with a positive sign and a negative sign are 
both equal to 2!PI-!_ So we can divide the argument of the log in each term by the same constant 
P(X —D = (X —D)|x]) without changing the factor. The resulting expression is exactly the expres- 
sion defining fd x_p in Eqn. (7), thus proving the first equality in Eqn. (8). The second equality in 
Eqn. (8) follows directly from Eqn. (1) and the definition of the factors as functions of probabilities 
in Eqn. (7). Eqn. (9) and (10) follow directly from Eqn. (8) and Theorem 3. E 


Appendix C. Proofs for Section 3.3 


In this section we give formal proofs of all theorems, propositions and lemmas appearing in Sec- 
tion 3.3. 
Proof [Theorem 5] The algorithm consists of two parts: 


e Collecting the empirical probabilities for each of the factors, jointly with the default instantia- 
tion of their Markov blanket. This can be done in three steps. [Below, recall that the maximum 
factor scope size is k, so there are at most 2‘J different canonical factors. Each variable can 
take on at most v different values. ] 


— For all instantiations of all factors initialize the occurrence count to zero. This can be 
done in O(2*Jv*), 

— When going through the m data points, we need to add to the counts of the observed 
instantiation whenever the Markov blanket is in the default instantiation. Reading a 
specific instantiation of a specific factor and its Markov blanket takes O(k + b) to read 
every variable. Thus collecting the data counts from which the empirical probabilities 
will be computed takes O(m2*J(k + b)). 


— Renormalizing all of the entries to get the empirical conditional probabilities takes time 
O(2 NV). 


e Computing the factor entries from the empirical probabilities. To compute one factor entry 
See (c), we have to add (and subtract) IC empirical log-probabilities. (Note this is the case 
J 


independent of the cardinality of the variables in the factor, as seen from Eqn. (7).) This gives 
us 0(2'Cil) operations per factor entry, and thus O(/J27*v*) total for computing the canonical 
factor entries from the empirical probabilities. 


Adding up the upper bounds on the running times of each step proves the theorem. | 


Appendix D. Proofs for Section 3.4 


In this section we give formal proofs of all theorems, propositions and lemmas appearing in Sec- 
tion 3.4. 
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D.1 Proof of Theorem 6 


The proof of the theorem is based on a series of lemmas. 
The following lemma shows that the log of the empirical average is an accurate estimate of the 
log of the population average, if the population average is bounded away from zero. 


Lemma 17 Let any £ > 0,5 > 0,4 € (0,1) be given. Let {Xj}, be i.i.d. Bernoulli() random 
variables, where A< <1—À. Let b= + i, Xj. Then for 


[logo —logd| < € 


to hold w.p. 1 —6, it suffices that 
(l+e)?, 2 
m2 2h2€2 log S 


Proof From the Hoeffding inequality we have that for 


lo-9| <e 





to hold w.p. 1 — ò it suffices that 
2 
S EE 5 (22) 
Since the function f(x) = logx is Lipschitz with Lipschitz-constant smaller than z 
val {A — £',1], we have that for 





1, over the inter- 


A e! 
logo — lo ———— 
[log — logo] < zy 
to hold w.p. 1 —64, it suffices that m satisfies Eqn. (22). Now for fo < € to hold, it suffices that 
el < fh. Using this choice of €' in Eqn. (22) gives the condition for m as stated in the lemma. W 





The following lemma shows that for distributions that are bounded away from zero, conditional 
probabilities can be accurately estimated from a small number of samples. 


Lemma 18 Let any €,5 > 0 be given. Let {xO yO ye ı be i.i.d. samples from a distribution P over 
X,Y. Let P be the empirical distribution. Let à = ming y P(X =x, Y = y). Then for 


|log P(X =x|Y = y) —logP(K = x|Y =y)| < € 
to hold for all x,y with probability 1 — 6, it suffices that 


1 2 
Z ai 5 log 





Viale 


4|val(X)||val(Y)| 
5 : 


Proof We have (using the definition of conditional probability and the triangle inequality) 


|log P(X SAA y) —log P(X =x|Y=y)| 

= |(logP(X=x,Y=y)—logP(Y =y)) 
—(log P(X = x, Y = y) —logP(Y =y))| 
|(logP(X =x,Y = y) —logP(X =x, Y=y)| 
+|(log P(Y = y) — log P(Y =y)]. 


IA 
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Now using Lemma 17 (note that à = miny y P(X = x, ¥Y = y) < miny P(Y = y)) and the Union bound 
to bound both terms by €/2, we get that for 


|log P(X = x|Y = y) —logP(X=x|Y=y)|<e (23) 
to hold with probability 1 — 28’, it is sufficient that 


(1+e/2)?, 2 
mene F 





m> 24) 


Using the Union bound, we get that for Eqn. (23) to hold with probability 1 — 2|val(X)||val(Y)|8’ 
for all x € val(X),y € val(Y) it suffices that m satisfies Eqn. (24). Choosing 6 = 2|val(X)||val(Y)|8’ 
gives the statement of the lemma. | 


Our algorithm uses probability estimates to compute canonical factors. The following lemma 
shows that accurate probabilities are sufficient to obtain accurate canonical factors. 


Lemma 19 Let any £ > 0 be given. Let any D,Y,W CX,DOY = 0,DOW = 0 be given. Then for 
all d € val(D) for 
llog fiyy(d) — log fiyw(a)| < € 


to hold, it suffices that for all instantiations d € val(D) we have that 


| log P(d|y) — log P(d|w)| < JD 25) 
Proof 
[log x(a) -loe fpw(@)| = | X (CDP “og P(oz:nla]|¥ =¥) 
ZCD 
- ¥ (-1) Flog P(ozold]|W = w)| 
ZD 
< } |logP(ozpld]|Y =F) 
ZD 
— log P(6z:n|d]|W = w)| 
€ 
< — 
7 PED 
= €; 


where, in order, we used the definitions of f* and f*; triangle inequality; Eqn. (25); number of 
subsets of D equals 2! a 


The next step is to show that, if we obtain good estimates of the factors, the distributions they 
induce should be close as well. The following lemma shows that distributions with approximately 
the same factors are close to each other, by proving a bound on D(P||P) + D(P||P), and thus (since 
D(-||-) > 0) a bound on D(P||P). 
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Lemma 20 Let P(x) = zWar f(c) and P(x) = 3 as f(c). Let €=maxje¢1.... J}, e; llog fi(¢;) — 
log fj(¢;)|. Then we have that 
D(P\|P) + D(P||P) < 2Je. 


Proof 


D(P\|P) + D(ÊI|P) 


Ex~p (log P(X) — log P(X)) +Ex.p (log P(X) — log P(X)) 


J 

* i * Z 

= Exp) (log fj (Cj) — log f;(C})) -log 5 
j=l 


NIN 


J 
+Exp LI log Ê(C ) — log f;(C})) —log 
< 2Je, 


where we used in order: the definition of KL-divergence; the definition of P, P; log + log =0, 
and the fact that each term in the expectation is bounded in absolute value by €. E 


Note that (by using the sum of the KL-divergences) we have that the terms that involve the 
partition functions Z and Ê cancel. This enables us to prove an error bound without bounding the 
difference |logZ — log Z| as a function of the errors in the factors. 

We now show how the previous lemmas can be used to prove the parameter learning sample 
complexity result stated in Theorem 6. 

Proof [Theorem 6] First note that since the scopes of the canonical factors used by the algorithm 


are subsets of the given scopes {C Pe “=1» we have that 
max ;|C; UMB(Cj)|_ < b+k. 


Let Ê be the empirical distribution as given by the samples {x‘}”,. Let Mj = MB(C;). Then 
from Lemma 18 we have that for any j € {1,...,/*} for 


| log P(C; = ¢;|Mj = mj) — log P(C; = c |M} = m7 )| < £ (26) 


to hold for all instantiations c},m} with probability 1 — 6’, it suffices that 





1 e')2 4ykto 
= = , z E z en 
yet (3) 
Using Lemma 19 we obtain that for all instantiations c} we have that Eqn. (26) implies 
[log fé- mB(cs)(¢;) — log fémes c(l < oe (28) 


Using the union bound, we get that for Eqn. (28) to hold for all j € J* with probability 1 —J*8’, it 
suffices that m satisfies Eqn. (27). When Eqn. (28) holds for all j € J*, Lemma 20 and Proposition 4 
give us that 


D(P\|P) Pomas < 2*2. (29) 
We have that J* < 2‘J. Choosing £’ = AT and 6’ = Ks 5 and substituting these choices into Eqn. (27) 
and Eqn. (29) gives the theorem. x 
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D.2 Proof of Theorem 7 


Proof [Theorem 7] From Proposition 4 we have that 


` 


1 
Ox)=5 Imo; (45). 


j=l 


We can rewrite this product as follows: 


1 x x 
Q(x) = 7 IT, fome) (d}) 
DRC {CE kal 
i MB(D3) = MB(D*) 





II, fome) (d) 

_ DE {Ci} 

Ë MB(D’) Z MBD 7 
II JOs MB (D) (dj). (30) 

JD HCH 

We also have 
oe 
We can rewrite this product as follows: 
D 1 px * 
P(x) = z IT, Jõmeo)(d}) 
Dee {Cry 
= MB(Ds) = “MBs ) 
Jomo) 

D} € (Gh 

. MB(D*) + MB(Ds) 

i II fomm (7) 
EDL 
MB(C3) = MB(C*) 
II fome (C7): (32) 
CED 
MB(C%) Æ MB(C*) 
We have (adding and subtracting same term): 
tog DMB) E) g Pime 4) Jomo) (33) 
O83 x FE * d* ! x *\ 7 
T p * Foi p) (4 i) Jomo) 
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We also have for j : C} ¢ {Dj }« that log Fema œ (C) = 


and subtracting same fea 


0. Thus we have (adding zero and adding 



































log j* (c}) =1 fem CV) de Jemc ©) an 
og Re Tire c = og ae) x n: 
Using Eqn. (30), Eqn. (32), Eqn. (33) and Eqn. (34) we get that D(Q||P) + D(P||Q) = 
A (dj) fp (d;) 
Ex~o( £ m D*|MB(D oe )- Expl X lo a) (35) 
Dp, Dimon)” pe crea, “Tmo ©) 
s MB(D*) = TBO’) k MB(D') = TBO;) 
Bo È frmo È) 
+Exo( X Jog PMB) 2 D*|MB(D*) : j= Expl a Jog PMB) l D?|MB(D* - ) (36) 
«et — (E) 
_ D} € {C} }{ı “DiIMBD;) © / _ D} € {Ce} Ds|MB(D; j 
si Mn Hi a Mon HA 
„(d * dt) 
; Dic {CE} D;|MB(D;) > / f D’ è (C D*|MB(D}) ` 7 
7 MB(D*)  MB(D5) j MB(D') 2 MB(D;) 
+Ex~o( L logfmem) (a;)) -Ex p( L les fo. mews) (dj) (38) 
IDIACOL, IDC, 
-Ex~o( E log fé; me(c;)(€})) + Ex~e( L log féx mpc) (€})) (39) 
j C} £ {Dear , Cpe Di \ 
` MB(C*) = MB(C*)  MB(C¥) = MB(C*) 
ae TE 
Seal n ig ia ear ) - Ex. £ oe MB(C! =) (40) 
Ce Dik kar OE) j -C g {Da C;IMB(C;)` 7 
” MB(C*) 4 MB(C*) ” MB(C%) # MB(C%) 
Sex n(c fë c) 
seal E mE nad y o) 
P C} £ (Dih C;|MB(C;) ` J r C ¢ Dikh- Ci|MB(C}) ` J 
MB(C*) # MB(C5) MB(C*) # MB(C*) 
Ž Z 
+log z + log F (42) 


Recall T = {j : C} ¢ {Diy p MB(C*) Æ MB(C*)}. Using the same reasoning as in the proof 
of Theorem 6, we have that for the sum of the terms in lines (35), (36), (39) and (40) to be bounded 
by Je with probability at least 1 — 4, it suffices that m satisfies the condition on m in Eqn. (11) of 


Theorem 6. 
The sum of the terms in lines (37) and (41) can be bounded by 


) 


Jeti c (c) 
2 Ł maxe | log serie 
j 


j : MB(C3)AMB(C*) fome) (cj) 
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The sum of the terms in lines (38) can be bounded by 


2 Ł maxa; log fp; (d,)|. 
ED ELC. a1 





The two terms in line (42) sum to zero. 
This establishes the theorem. E 


Appendix E. Proofs for Section 3.5 


We will treat the proofs for the factor graph case and the Bayesian network case in two separate 
sections. 


E.1 Proof of Theorem 8 


Proof [Theorem 8] Using the same reasoning as in the proof of Theorem 6, we get that for any fixed 
jE{1,--- ,J*} for 


g’ 








og Fes me(c;) (7) — log Fe-imaces)(€7)! S zer (43) 
to hold for all instantiations c} with probability 1 — &’ it suffices that 
1+ -e 2 Ayk+b 
"ON A. log ——. (44) 
2y2k+2b( wit)? 5 





Also, using the same reasoning as in the proof of Lemma 20, we get that 
> s pee A 
D, (P||P) +D, (P||P) < R >»: maxe;|log fox \mB(c;) (¢}) = log fe: mpce;)(¢)) 
jel 


We have for all factors and instantiations that (recall that 2" probabilities contribute to each factor, 
and each probability is over k variables, thus each (log) conditional probability has maximal skew 


log doe < klog D 


A 1 
* * ok * k 
[log fé me(cs) (Cj) — 108 féme(c) (€7)| < K2“ log T 


Note that clipping of the probability estimates ensures this holds with probability one. Thus we get 
that 





5 “a 2, £ pas cas ml 
E(D,(P\|P)+D,(P||P)) < oT seat +6 = J k2 Pe 
J J 1 
< -e + =k2%+t18' log —, 
n n y 
where for the last inequality we used J* < 2*J. The Markov inequality (P(X < a) > 1— EX) gives 
us that 
D,(P||P) + D, (P||P) < € (45) 


1772 


LEARNING FACTOR GRAPHS IN POLYNOMIAL TIME AND SAMPLE COMPLEXITY 








holds with probability 
Jol 1 Jpg2kt+1slyog 1 
zE + k2*°"'8'log 7 
€ 
f k2% Slog 4 ae : : ‘ 
Now choosing ¢’, 8’ such that è = Z e = 7 “FY and substituting this back into the sufficient 


condition on m, gives us that for Eqn. (45) to hold with probability 1 — 4, it suffices that 











(1 + nye k22kt4yk+b log 
2 0g 
Ce eC 


Since the number of factors per variables is bounded by a constant, we have that 7 is bounded 
by that constant. And thus we have that m is O(1) when considering only the dependence on n, the 
number of variables. E 


E.2 Proof of Theorem 9 


For clarity of the overall proof structure of Theorem 9, we defer the proofs of the helper lemmas 
to the next section. Note the theorem stated in this section is stronger than Theorem 9: it includes 
dependencies of m on £, ð, the maximum domain size of the variables v, and the maximum number 
of parents k. It also shows the graceful degradation for the case of learning a distribution that does 
not factor according to the given structure. 

For any y < L, and any multinomial distribution with means 01:y, the multinomial distribution 
with means clipped to [y, 1 — y| refers to the distribution obtained by clipping every 9; to [y, 1 — yl, 
after which the 6; are adjusted to sum to one, while kept in the interval |y, 1 — y]. It is easily verified 
this is always possible without changing any 0; by more than vy. (Although the adjustment such that 
the entries sum to one need not be unique, it does not matter for our results which choice is made.) 
We write p(o? e2) as a shortcut for D(P;||P2), where Pı, P) are multinomial distributions with 


means ef), ae 9 and oP, ci 9?) respectively. The following lemmas establish the basic results 
used to prove our main sample complexity bounds for Bayesian networks parameter learning. 


Lemma 21 Let any ò > 0,€ > 0 be fixed, and let there be m i.i.d. samples drawn from a v-valued 
multinomial distribution with means Bi: .y and let 6,., be the empirical distribution, clipped to the 
1- a3) Then ifm > © Si Jog 2 , we have that D(8}.,, Oy) <E wp. 1-6. 


Lemma 22 Let two v-valued multinomial distributions with means of) € [0,1]”, 9) E€ [y,1—y]” be 








interval | $5, 


given. Then we have that piel? e2) < log F 


Lemma 23 Let my be the sum of m i.i.d. Bernoulli(p) random variables. If m > Š log 3, then we 
have that my > =P with probability 1 — ô. 


Lemma 24 Let {X)}‘] be a set of k+1 random variables with |val(X;)| < v for alli=1:k+1. 
Let u € val(X1:x). Let any £ > 0,8 > 0 be given. Let P(X44|X1.~ = u) be the empirical estimate 
of Xk+1|X1:x = u (based on m independent samples of {X; ee drawn from P(X .¢+1)) clipped to the 
interval Ee „1 ixar! Then to ensure that D(P(Xx41|X1:k = U) ||Ē(Xk1|X1:x = u) ) < 


w.p. 1 — ò, it suffices that m > 16. Es 18+ log? + de log Ẹ 








vi/? Ba 1:K=U) 
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Lemma 25 Let {X;}‘*} be a set of k+1 random variables with val(X;) < v for alli=1:k+1. 
Let any £ > 0,5 > 0 be given. For all u € val(Xı:x) let P(X,41|X1-4 =u) be the empirical esti- 
mate of X41|X1.4 = u (based on m independent samples of (xe drawn from P(X\:441)) clipped 


to the interval EE „1 Iaka! Then for Yucval(X,) P(X1:k = U)D(P(Xk1|Xi:x = U) 
4yk+1 


||P (Xeu1|X1-4 = u)) < € to hold with probability 1 — 6, it suffices that m > lowe log? = log "y~ 








Because KL divergence can be unbounded, typically some process, such as clipping, is needed 
to ensure that our algorithms do not suffer infinite loss. Lemmas 21 and 22 show that we can bound 
the KL divergence by clipping the (estimated) probabilities away from {0,1}. (Abe et al. (1991) 
and Abe et al. (1992) give a more detailed treatment of uniform convergence for KL divergence 
loss.) Lemma 24 shows how to bound our error on individual conditional probability table (CPT) 
entries. Note that in Lemma 24, the loss is allowed to be larger for less likely instantiations of the 
conditioning variables. Also note that Lemma 24 shows that the number of samples m required 
does not depend on the probability of the instantiations of the conditioning variables, no matter 
how likely/unlikely. Lemma 23 is used in our proof of Lemma 24 to relate the required number of 
samples with a specific instantiation u of the conditioning variables to the actual number of training 
examples required. Lemma 25 relates the loss on individual CPT entries to the conditional KL 
divergence, and follows directly from Lemma 24 and Cauchy-Schwarz. 

Using the lemmas above, we are now ready to prove a bound on the sample complexity of learn- 
ing a fixed structure BN. We note that Dasgupta (1997) showed a bound on the sample complexity 
of BN learning that was polynomial in the number of variables n. His proof method relied on using a 
Union bound to show that all of the n nodes in the BN will have accurate CPT entries, which meant 
the bound necessarily had to have a dependence on n (even if the normalized KL criterion had been 
used). For the normalized KL criterion, his method gives a logarithmic dependence on n. Below, 
we will derive a strictly stronger bound, which has no dependence on the number of variables in 
the BN. Our bound is based on showing that (i) Given any fixed node, with high probability, its 
CPT entries will be accurate (Lemma 25), and (ii) Using the Markov inequality to show that, as a 
consequence, almost all of the nodes in the network will have CPT entries that are accurate. This 
turns out to be sufficient to ensure the estimated BN parameters will provide a good approximation 
to the joint distribution, and eliminates the bound’s dependence on n. 

In the theorem below, P is some “true” underlying distribution from which the samples are 
drawn; Pgy is the best possible approximation to P using a given BN structure (in the sense of 
minimizing D,(P\|-)), and Pgy is the learned estimate of P. We give a bound on the number of 
training examples required for Pgy’s performance to approach that of Pgy. 


Theorem 26 Let any £ > 0 and ò > 0 be fixed. Let P be any probability distribution over n multi- 
nomial random variables X,.,, where each of the random variables X; can take on at most v values. 
Let any BN structure be given, and let k be the maximum number of parents per variable. (P may 
not factor according to the BN structure.) Let Pgy be the best possible estimate of P using a model 
that factorizes according to the BN structure. (I.e., Pgy’s conditional probability distributions sat- 
isfy Pgy(X;|PaX;) = P(X;|PaX;).) Let Pgy denote the probability distribution obtained by fitting (via 
maximum likelihood) a BN model with the given structure to the m i.i.d. training examples drawn 
from P, and then clipping for each X peek CPT entry to the interval Erta aal TAEA ]. Then, 
to ensure that with probability 1 —6, Pgy is nearly as good an estimate as Pgy of the true distribution 
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P, that is, we have 
D,(P\||Panv) < Dn(P||Pan) +€, 


it suffices that the training set size be 


vitk]og? 8 gy kt i 8v? 
e SE 





Remark. Note that if P does factor according to the given BN structure, then the term D, (P||Pgy) 
above equals zero. 
Proof The following equality is easily verified: 





D(P||Žsn) = D(P||Psn) 
n 
+}. P(PaX; =u)D(P(X;|PaX; = u)||P(X;|PaX; = u)). (46) 
J=1ueval(PaX;) 
From Lemma 25 we have that for estimates clipped to | Tal wl Twat) 5], that for 


P(PaX ; = u)D(P(X;|PaX; = u)||P(X;|PaX; =u)) < e’ 
ucéval(PaX;) 


to hold with probability 1 — 7, it suffices that 


Atk 42 4v3 
vit log = 
m> z2 = log 





4yk+1 


(47) 


Now let Z = Y;ni be the sum over indicator variables ni = 1{YyP(PaX; = u)D(P(X;|PaX; = 
u)||P(X;|PaX; = u)) > £'}, and let t be as above. Then applying the Markov inequality to the 
non-negative random variable Z gives 


POY his SS l= 6: (48) 
So, we have that 


= 3 
D,(P\|Pav) < Dn(P\|Pew) + +EH -nde + EL nilog E 
D,(P||Pav) +£ + § log wè wp. 1-8. (49) 


e/ 





IA 


For the first inequality we used Eqn. (46), the definition of ù); and Lemma 22. The second inequality 
follows from Eqn. (48). To bound the right hand side of Eqn. (49), we bound each of the terms by 


ei For the first term this implies £' = a for the second term, this allows us to solve for the free 
e _ _ ð 








parameter T = . Substituting these expressions for €' and T into Eqn. (47, 49), gives 


2log m E 2log 33 
the statement of the theorem. a 


Note that Eqn. (46) holds for all distributions Pgy that factor according to the BN. Since KL 
divergence is always non-negative, Eqn. (46) implies that D(P||Psv) < D(P|| Pen). So the clipped 
maximum likelihood learning achieves the minimal KL divergence loss for infinite sample size. 


1775 


ABBEEL, KOLLER AND NG 


Also note that in general, D(P||Pgy) is not equal to D(P||Pgv) + D(Pgn||Pgn). In particular, the 
second term in Eqn. (46) is not equal to D(Pgy||Ppy), since Psy (PaX; = u) is (in general) not equal 
to P(PaX; =u). (In contrast, for log-linear models/undirected graphical models a decomposition of 
the KL-divergence does hold.'”) 


E.3 Proofs of Lemmas 21, 22, 23, 24, 25 
We will first state and prove two lemmas that are used to subsequently prove lemmas 21, 22, 23, 24, 25. 


2 
v 


Lemma 27 For any 0\") ol € (0, 1)’, E o” = 1,50% = 1, we have 


liv? 








<p aes 


i=1 0; 


Proof We use the concavity of the log function, to upper bound it with a tangent line at 9 a! which 
gives the following inequality: 


loge!” < logo” + (0P -0®). (50) 
Q 


i 
Substituting Eqn. (50) into the definition of D(8' Diol )) gives us: 
1) 


v g!) 
0; g(t) 9?) 
Hep) i ). 
Oj. l: <La 





l 


a a) a2) ; se chases! 
Adding 0 = X}; af <a (8; " —9;"") to the right hand side gives. 








<Low ol” (1) —9())2, 


which proves the theorem. | 


Lemma 28 Hor any v-valued multinomial distributions with means gi!) € [0,1]” and oP) € [y,1 
yl”, with y < +, we have 


pie 162) < ¥ y= 8} 
( liv iv) = J 








2 





Proof Immediately from Lemma 27, since = < 


I+ Il 


rc 


i 





17. Let P be any distribution, let {Pg} be a family of log-linear models parameterized by 8, let 0* = arg ming D(P||Po). 
Then we do have that D(P||Pg) = D(P||Pọ) + D (Pø || Pa). The proof relies on the fact that 6* is such that Ep [n;] = 
Ep, [ni], Vi, with n; the natural parameters of the log-linear model (see, for example, Kullback (1959)). Due to the 
local normalization constraints, this is not true in BN’s. 
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Proof [Lemma 21] Let 6;., be the unclipped sample means. The triangle inequality gives for any 


reihe ak 
|; —õ;| < |0} — ô;| + |6; — 6). 51) 


From the Hoeffding inequality and the Union bound we have that for all i € {1,--- ,v} for 
|e; — 6;| < £' (52) 


to hold w.p. 1 — ð, it suffices that 
2v 
> —— 


m> ya E (53) 


Since 6}., are obtained by clipping ĝi: into [y, 1 — y| (for now yis a free parameter, which will soon 
be matched to the clipping choice of iw of the lemma), we have that (see introduction of previous 
section) 

\6; — 6;| < vy. (54) 


Using Lemma 28 and then Eqn. (51), (52), and (54) we have that 


v 0% i a ety 2 
D@rxl\51s) < F E < v! r- v) (55) 
i=1 





holds w.p. 1 —6 if m satisfies Eqn. (53). The choice of y = g minimizes the right hand side of 
Eqn. (55), and gives us that 


D(01-y||61-») < 4v7e". (56) 


Now choosing £’ = me (corresponding to y = 73) gives us that for 








D(81:»||61-y) < € 
to hold w.p. 1 — 6 it suffices that 
8yt 2v 
m> ae 8s 5° 


which proves the lemma. a 


Proof [Lemma 22] We have 


v 1) 
DEPIL) = Yel tog % 
i=1 





IA 


max; log —— 





IA 


max; log Ro 
i 


IA 


1 
mMaXyejy,1 -y log y 


1 
log =, 
Y 
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which proves the lemma. a 


Proof [Lemma 23] Let f = 7#, then 





Pr(my < P”) = Pr(p < Ë) = pr(PP > ¥P), 


nee 


Applying the (multiplicative) Chernoff bound gives 
Pr(ma < ZP) < exp £") =8, 


where the last equality defines 5. Solving the last equation for m shows that m = plog § samples 
are sufficient to guarantee Pr(my > %) > 1 — ô, which is the statement of the lemma. a 


Proof [Lemma 24] Below let 0; = P(Xp41 = i|X1:4 = u), let 6; = P(X.) = i|X1., = u), and let 
y = |val(Xx+1)|. We split the proof into 2 cases 





Loez > log a? This case is trivial, since by Lemma 22 we have that D(01.;||61.) < 
v Bae u) E 

log = A < log = 4v" and the statement of the lemma is trivially implied, for all 61.5 € EN = 

a [% so m = 0 samples is sufficient. 


2. 7 E < log ar Let my be the number of samples for which X;.. = u. Then (using 
v 1:kK7U 


Lemma 21 and v < v) a number of samples 





8v4 P(X =u 2v 
> digs y (57) 





Mu 


is sufficient to guarantee that D(01-5||61.5) < pease with probability 1 — 8’. To obtain, 
Lik 


with probability 1 — 6”, at least my samples from P(X1:z+1) for which X1: = u, it suffices that 
the total number of samples m from P(X1.,+1) satisfies 


8 lo 1 2my j 
Pir- o Piru) 





m > max{ 


where we used Lemma 23. Using P(X: = u) > a. (we are in case 2) and (57), and 


setting ð = 6/2, 5” = 6/2, gives the statement of the lemma. 


Proof [Lemma 25] Using Lemma 24 and the union bound over all instantiations u of X1: (there are 
at most v* instantiations) we get that for 


€ 


Vu € val (Xix) D(P(Xk+1|X1:x = u) ||P (Xk+ |X1: = u)) < vk/2../P(X1.4 =U) 





(58) 
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to hold with probability 1 — ð, it suffices that 


Lov**# log? 4° gy 
m> = log. (59) 


So we have that the following inequalities hold w.p. 1 — 6 if m satisfies Eqn. (59): 





L P(X = u)D(P(Xep1 [Xie = u) || ËX X1: = u)) 


uéval(X1-«) 





€ 
< P(X a u) 
3 v2 P(X 1x = u) 
P(Xi:x = u) 
<}e v2 


where we used in order: Eqn. (58), simplification, Cauchy-Schwarz. The last inequality together 
with the condition in Eqn. (59) prove the lemma. E 


Appendix F. Proofs for Section 4 


In this section we give formal proofs of all theorems, propositions and lemmas appearing in Sec- 
tion 4. 


F.1 Proof of Lemma 12 


We first prove the following lemma. 


Lemma 29 Let any € > 0,5 > 0 be given. Let any À € (0,1) be given. Let {X;}; be iid. 
Bernoulli(@) random variables, where à <0 < 1 — À. Let ọ = tym X;. Then for 


|blog — dlogd| < € 


to hold w.p. 1 — ò, it suffices that 





= í 2 i 2 2 i 3 
m > max o o : 
= EEN 2 8 5) EES 
Proof From the Hoeffding inequality we have that for 

lo-9| <e 


to hold w.p. 1 — ò it suffices that 

2 
3 . 
Now since the function f(x) = xlogx is Lipschitz with Lipschitz-constant smaller than max{1,|log(A— 
e’)|} over the interval | — €', 1], we have that for 


|blog — blogg] < e'max{1,|log(A—e’)|} 


1 
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to hold w.p. 1 —6, it suffices that m satisfies Eqn. (60). If we choose €' such that €' < 4/2 we get 
|blog @ — logĝ| < e'max{1, |log(A/2)|}. 


To ensure the right hand side is smaller than €, it suffices that the following three conditions are 
satisfied: 


e < A 
Sog 
e < ¢€, 
; À 
€ < eà/2 < e/|log 5|: 


The last inequality holds since A € (0,1). Since à € (0,1) we can simplify this to the following two 
conditions: 
paa a 
=. 2 
e < eh/2. 


E€ 


Substituting this into Eqn. (60) gives us the condition for m as in the statement of the lemma. E 


Proof [Lemma 12] We abbreviate P(X = x,Y = y) as P(x,y) and similarly for P,x,y,x|y. We 
abbreviate )xcvai(x) by Lx and similarly for y. 


H(X|Y) - A(XIY)| = )log P(xly) DPC x,y a 








a x, y) log P(x, y) TEPO )log P(y 
~L Py) log P(x, y) + Py )log Ply y 
23 


xy 
+E |P) log P(Y) — Ply) log P(y)| 
y 


IA 


P(x,y)logP(x,y) — P(x,y)log P(x,y)| 





Now using Lemma 29 (and the Union bound) we get that for 
H(X|Y) - A(XIY)| < |val(X)||val(Y) |e’ + |val(¥)|e’ 


to hold w.p. 1 — |val(X)||val(Y)|6! — |val(X)|ð', it suffices that 





2 2 
log yh 


2 2 
m > max{ NE log y 


Choosing £ = £'/(2|val(X)||val(Y)|) and 6 = &'/(2|val(X)||val(Y¥)|) gives that for 
IH(XIY)-Â(XIY)| <e 
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to hold with probability 1 — ð, it suffices to have 


8|val(X)|*|val(Y)|? ibs 4|val(X)||val(Y)| 2 


4|val(X)||val(Y)| 
ie 5 ee 


lo 5 


m > max{ 





} (61) 


Now since for any two distributions P and Ê we have |H(X|Y) — A(X|Y)| < log|val(X)| 
< 2|val(X)||val(Y)|, we have that for any £ > 2|val(X)||val(Y)| the statement of the lemma holds 
trivially independent of the number of samples m. Thus we can simplify the conditions on m in 
Eqn. (61) to one condition: 





8|val(X)|*|val(Y)|?___ 4|val(X)||val(Y)| 
De TER = <a By 


which proves the lemma. a 


F.2 Proof of Lemma 13 


We abbreviate P(X = x) as P(x) and similarly for other variables. 
Proof [Lemma 13] Using Eqn. (14) and the definition of conditional entropy we get that 


Ss P(x,u,v,w,y) log P(x|u, v,w,y) me, Ł P(x,u, w) log P(x|u, w) <e. 


X,U,V,W,y x,U,W 
We can rewrite this as 


P(x|u, Vv, w,y) 


X,U,V,Wy 


Now using Eqn. (13) (UUV is the Markov blanket of X) gives us 


P(x|u, v) 


————~_ <. 
P(x|u,w) ~ 


} P(x,u,v, w,y) log 


X,U,V,W.y 


We can simplify this to 
P(x|u, v) 
Ł P(x,u, v, w) log —-—— <e. 
X,U,V,W P(x{u, w) 
Using the definition of conditional probability and Eqn. (13) (U U V is the Markov blanket of X) we 
get 


P(x|u,v) 
P P log ———— <e. 
p3 LALF (x|u, v) log P(xju.w) E 
P(x|u,v) 





Now since A; < P(u,v,w) and each term ), P(x|u, v) log 
we get that for all u,v, w 


Pauw) is positive (it’s a KL-divergence) 


P(x|u,v) g 


€ 
DAY) E cian) ve 
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The left hand side of this equation is the KL-divergence between a distribution Qy y w(X) = P(X|U = 
u, V = v) and a distribution Quy w(X) = P(X|U = u, W = w). Now using the KL-divergence prop- 
erty that (Ly |Pi(x) — P2(x)|)? < D(Pi||P2) (see, for example, Cover and Thomas, 1991, p. 300), 
we get that for all u,v, w 


5 (Px, v) —Pexla,w)))? < È 


As a consequence, we have for all x,u, v, w that 


€ 
|P(x|u,v) — P(x|u,w)| < 27. 
1 
Now since Az < P(x|u,v) and Az < P(x|u,w) we have that 
vže 
|logP(x|u, v) — log P(x|u, w)| < ———. 
Aa VAi 
Now using Eqn. (13) (UUV is the Markov blanket of X) to substitute P(x|u,v) by P(x|u,v,w,y), 
we obtain Eqn. (15). E 


F.3 Proof of Theorem 14 


Proof [Theorem 14] There are O(kn*bn?) (candidate factor, candidate Markov blanket) pairs, each 
with O(v*+?) different instantiations. Collecting the required empirical probabilities from the data 
takes O(kn* bn? y+” + mkn*bn? (k+ b)). (Similar reasoning as in the proof of Theorem 5.) Comput- 
ing the empirical entropies from the empirical probabilities takes O(kn*bn?v'*°). There are O(kn*) 
actual factors computed. From (the proof of) Theorem 5, we have that this takes O(kn*(m(k +b) + 
2*vk)). Putting it all together gives us an upper bound on the running time of 


O(kn‘bn?v*? + mkn*bn?(k + b) + kn bn” yk? 
+ kn*(m(k +b) +2'vV')). 


After simplification we get a running time of 


O (kn*bn? yk + mknbn? (k + b) + kn'2*y*) ; 


F.4 Proof of Theorem 15 
Proof [Theorem 15] Let C, Y be defined as in Eqn. (16) and Eqn. (17) of the structure learning 
algorithm description. For all C} € C we have by assumption |val(C;)| < vk. For all Y € Y we 
have |val(Y)| < v’. Also note that P(Ci=c,Y=y)> ze: Using Lemma 12 we get that for any 
Ci ec, Y Ey for 

|H(C3|¥) — A(Ci|Y)| < e (62) 
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to hold with probability 1 — 8’ it suffices that 
2k 2b pk yb 


>8 lo À 
m yb +t2kg!2 g 5! 





(63) 


Taking the union bound we get that for Eqn. (62) to hold for all Cj € C and for all Y € Y with 
probability 1 — |C||9/|6' it suffices that m satisfies Eqn. (63). = 

For MB(C}) = arg minycy, ynci= A (C$ |Y) we have H(C*|MB(C*)) < H(C;|MB(C})). Com- 
bining this with Eqn. (62) gives us 


H(C%|MB(C%)) < H(C%|MB(C*)) + 22. (64) 


From Lemma 13 we have that Eqn. (64) implies that 








* ATID * * * 4e’ 
2Ve 
meen (65) 
yer 
Now from Lemma 18 we have that for 
*IMAD * D Ekra * 2y £ 
|log P(C;|MB(C;)) — log P(C;|MB(C;))| < y (66) 
to hold for all instantiations c} € val(C¥) with probability 1 — 6”, it suffices that 
(Lt gs)? aye (67) 


T 2y2k+2b( af 2 ar 


Using the Union bound, we get that for Eqn. (66) to hold for all C} € C with probability 1 — |c|8", 
it suffices that m satisfies Eqn. (67). Or after simplification (and slightly loosening using y < 1), we 
get the condition 





(1+2ve)? 4vkt 
mz QyPk+ 2be! 0g a” $ 


Combining Eqn. (66) and Eqn. (65) gives us 


(68) 


g' 
|logP(C}|X — Cj) — log Ê(C*|MB(C*))| < s 
From Lemma 19 we have that this implies 
* k vel 
[log féx- c;(e Cy log f* C;|MB(C* c(i) <2 KET 
Now choosing £’ = EE gives us that 
€ 


[log féx- c;(¢ cj) — log f* -rc (C7) < Ak+2° 
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The clipping to one of factor entries fe CIB cy c*) satisfying |log f «NB (c(i ;)| < se introduces 
at most an additional error of J Thus after the clipping we have, 
€ 
[log féx- c(e c) — log f* „rB (CF :)| < RFT? (69) 


for all canonical factors of P. We also have that all candidate factors that are not present in the 
canonical form of the true distribution P will now have been removed and do not contribute to 
P. (By our assumption on b the algorithm considered large enough Markov blanket candidates to 
include the true Markov blanket. Such a large enough b for these factors (which can be larger than 
the maximum Markov blanket size for factors present in the distribution) is important. Trivial (all- 
ones) canonical factors computed using their Markov blanket require a true Markov blanket to be 
all-ones.) 

So far we have shown that Eqn. (69) holds with probability 1 — |C||9|8! — |C|6” if m satisfies 
both Eqn. (63) and Eqn. (68). Or, after substituting in the choice of £’, if the following hold 


g8k+19,, 2k+2b 4y k+b 





m: `z yt 6b eA log y” 
+b 
m > r LE2RR) pg ah 





S yik tbe? og 8! 


So choosing 8! = 6” = 8/(2|C||9|), we have that for Eqn. (69) to hold with probability 1 — 6, it 
suffices that 
et? ytkt2b28e+19 Mabaa 
m2 (1+ a3 yb minde? e} © 5 








Now using the fact that |C| < kn* and || < bn? we obtain the following result: with probability 
1 — 8, Eqn. (69) holds for all non-trivial canonical factors in the target distribution if m satisfies the 
condition on m in the theorem, namely Eqn. (19). Moreover (recall the clipping procedure removed 
all candidate factors with scope less than k and Markov blanket size less than b that are not present 
in the canonical form of the true distribution P) we have that zero error is incurred on all other 
factors. Thus (after using Lemma 20) we have that 


D(P\|P) + D(P\|P) < 2u* — < Je. 


a = 


The second inequality follows since J* < 2*J. 


E5 Proof of Theorem 16 


Proof [Theorem 16] From Proposition 4 we have that 
1 K * 
z I I; mew; (87) 
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We can rewrite this product as follows: 


oax) =% I] fö meow:)(d}) I] fme) (45) 
j:|D*|<k,MB(D*)| <b j:[D*|<k,|MB(D%)|>b 
I] meod) (70) 
J:|D}|>k 


The learned distribution P = 4 H =| f CMB cS c7), can be rewritten as 


P(x) = Fomm (43) I Fome) (43) 

J:|D}|<k,|MB (D})|<b J:|D}|<k,|MB (D})|>b 
Fome) F omele). 

iC} {D} Jx: |MB(C})| <b J:C74 {D} jk: |MB(C})|>b 


N| — 


(71) 


Using Eqn. (70), Eqn. (71), the fact that for all Cj ¢ {D} }% we have that the canonical fac- 
tor is trivial, namely log fémes (Cj c*) = 0 (and adding and subtracting the same terms) we get: 


D(Q||P) + D(P||Q) = 











Jomo (Ù) Jome) (45) 
Eo E p r)a, E op a) (72) 
[Dyk f'omo) „ (Dy <k f'omo; (4) 
Ë IMB(D$)| <b Ë IMB(D')| < <b 
“smp (d) Fac (G) 
D;|MB(D*) j piney T 
+Ex~o( Y logg e. i T ) Expl Y bgp Lal i rT (73) 
[yl <4 Fomo) „ [Dyk f'omo) 
` [MB(D;)| >b ` [MB(D;)| >b 
Sp. ;|MB(D p) (d}) f smB(D) ($) 
+Exo( E bs ep a) -Exel E g ep a) a 
„ (Disk D;|MB(D3) „ [Disk D;|MB(D;) ““ 
IMB(D')| > b IMB(D')| > b 
+Ex.o( L log fp: mB(D5) (d} ))- Expl 2 log förmer) (d})) (75) 
J:|D}|>k j:|D}|>k 
-Exo( E efom) +Exe( E gfo) 78) 
3 C} ¢ {D} tk x C} ¢ {Dube 
` [MB(C$)| < b ` [MB(C$)| < b 
Facsen (È) Tasca) 
C/MB(Cs) J eien lG 
Hol eg) a N e 
SD amot EDD amot 
IMB(C3)| >b IMB(C5)| >b 
S&mec (È) SeimB(cr) G) 
+Ex~o( L log eee) — Ex~p( L log TA =) (78) 
E Ci ¢ {D} Se C;|MB(C;) ` / Ci ¢ {Di te C;|MB(C;) J 
` |MB(C%)| >b ` MB(G;)| >b 
pigs 5 tle 5. (79) 
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Using the same reasoning as in the proof of Theorem 15 we obtain that for the sum of the terms 
in lines (72), (73), (76) and (77) to be bounded by (J + |S|)e with probability at least 1 — 4, it suffices 
that m satisfies Eqn. (19). The additional term in the bound, namely |S, is necessary to bound the 
error contribution of the terms in line (77). 

The sum of the terms in lines (74) and (78) can be bounded by 


fém) 
C3) 





2 oy MAX ¢* log = 
C3EC : |MB(C*)|>5 fe mcl 


The sum of the terms in line (75) can be bounded by (recall MB(-) is the true Markov blanket 
for the true distribution Q, thus fj. (dj) = frime (4 i) 
J J J 





2 Ł maxa; log fp; (dj) |. 
J:|D5|>k 
The two terms in line (79) sum to zero. 
This establishes the theorem. E 
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